View Issue Details

IDProjectCategoryView StatusLast Update
0001222bareos-coredirectorpublic2023-07-18 15:52
Reportera.key Assigned Tobruno-at-bareos  
PriorityhighSeveritycrashReproducibilityalways
Status closedResolutionunable to reproduce 
PlatformLinuxOSCentOSOS Version7
Product Version19.2.6 
Summary0001222: Director crashes with SIGFAULT when maximum number of threads (connections)
DescriptionThe director process crashes with SIGFAULT when a large number of clients connect simultaneously. The director logs only:

06-Apr 13:49 director.example.com ABORTING due to ERROR
lib/bnet_server_tcp.cc:410 Could not add thread to list.

 
Steps To ReproduceConfigure a large number of clients ( above 50) with "Client initiated connections".
As the connection "wave" may come with a random delay after starting the director reduce the maximum number of connections from the default 30 to less (In my case for 95 clients) reducing the maximum connections to 29 crashes the director every time when the wave of connections come in.

Director { # define myself
  Name = director.example.com
  QueryFile = "/usr/lib/bareos/scripts/query.sql"
  Password = "<PASSWORD>" # Console password
  Messages = Daemon
  Auditing = yes

  Maximum Connections = 29

  #TLS Enable = yes
  #TLS Require = yes
  #TLS Verify Peer = yes
  #
  TLS DH File = /etc/bareos/tls/dh2048.pem
  TLS CA Certificate File = /etc/bareos/tls/ca.pem
  TLS Key = /etc/bareos/tls/dir.key
  TLS Certificate = /etc/bareos/tls/dir.crt
}

Start the director daemon in debugging:
bareos-dir -u bareos -v -f -d 999 -dt


After the initial process setup, connecting to the catalog and storage daemon... when the wave of connections hit the director it will crash with:

06-Apr-2020 14:08:33.557199 director.example.com (900): lib/btimers.cc:178-0 Start bsock timer 7f3ac801ce30 tid=0x5800007f3ad77fe7 for 600 secs at 1586178513
06-Apr-2020 14:08:33.557227 director.example.com (50): lib/cram_md5.cc:82-0 send: auth cram-md5 <54030378.1586178513@director.example.com> ssl=1
06-Apr-2020 14:08:33.560315 director.example.com (50): lib/bnet.cc:142-0 TLS server negotiation established.
06-Apr-2020 14:08:33.560349 director.example.com (900): lib/btimers.cc:178-0 Start bsock timer 7f3ad001c5f0 tid=0x5800007f3ad7fff7 for 600 secs at 1586178513
06-Apr-2020 14:08:33.560377 director.example.com (50): lib/cram_md5.cc:82-0 send: auth cram-md5 <1064229369.1586178513@director.example.com> ssl=1
06-Apr-2020 14:08:33.560453 director.example.com (800): lib/thread_list.cc:243-0 Run WorkerThread successfully.
06-Apr-2020 14:08:33.560493 director.example.com (100): lib/bsock.cc:84-0 Construct BareosSocket
06-Apr-2020 14:08:33.560505 director.example.com (800): lib/thread_list.cc:216-0 Number of maximum threads exceeded: 29
06-Apr-2020 14:08:33.560529 director.example.com (850): lib/message.cc:1309-0 Enter Jmsg type=1
06-Apr-2020 14:08:33.560547 director.example.com (850): lib/message.cc:641-0 Enter DispatchMessage type=1 msg=director.example.com ABORTING due to ERROR
lib/bnet_server_tcp.cc:410 Could not add thread to list.
director.example.com ABORTING due to ERROR
lib/bnet_server_tcp.cc:410 Could not add thread to list.
06-Apr-2020 14:08:33.561357 director.example.com (850): lib/message.cc:873-0 APPEND for following msg: director.example.com ABORTING due to ERROR
lib/bnet_server_tcp.cc:410 Could not add thread to list.
06-Apr-2020 14:08:33.561754 director.example.com (850): lib/message.cc:754-0 CONSOLE for following msg: director.example.com ABORTING due to ERROR
lib/bnet_server_tcp.cc:410 Could not add thread to list.
06-Apr-2020 14:08:33.561926 director.example.com (850): lib/message.cc:847-0 MAIL for following msg: director.example.com ABORTING due to ERROR
lib/bnet_server_tcp.cc:410 Could not add thread to list.
06-Apr-2020 14:08:33.561941 director.example.com (850): lib/message.cc:299-0 mailname=/var/lib/bareos/director.example.com.director.example.com.17910416.mail
BAREOS forced SEG FAULT to obtain traceback.
06-Apr-2020 14:08:33.562130 director.example.com (900): lib/signal.cc:119-0 sig=11 Segmentation violation
BAREOS interrupted by signal 11: Segmentation violation
bareos-dir, director.example.com got signal 11 - Segmentation violation. Attempting traceback.
exepath=/etc/bareos/bareos-dir.d
06-Apr-2020 14:08:33.562206 director.example.com (300): lib/signal.cc:183-0 Working=/var/lib/bareos
06-Apr-2020 14:08:33.562218 director.example.com (300): lib/signal.cc:184-0 btpath=/etc/bareos/bareos-dir.d/btraceback
06-Apr-2020 14:08:33.562228 director.example.com (300): lib/signal.cc:185-0 exepath=/etc/bareos/bareos-dir.d/bareos-dir
06-Apr-2020 14:08:33.563057 director.example.com (500): lib/signal.cc:216-0 Doing waitpid
Calling: /etc/bareos/bareos-dir.d/btraceback /etc/bareos/bareos-dir.d/bareos-dir 17393 /var/lib/bareos


The traceback file is the most unhelpful debug output ever in this case:
# cat /var/lib/bareos/director.example.com.17393.bactrace
Attempt to dump current JCRs. njcrs=2
threadid=0x3900007f3b3b6108 JobId=0 JobStatus=R jcr=0x1147550 name=*JobMonitor*.2020-04-06_14.08.03_01
threadid=0x3000007f3b3b6108 killable=0 JobId=0 JobStatus=R jcr=0x1147550 name=*JobMonitor*.2020-04-06_14.08.03_01
        UseCount=1
        JobType=I JobLevel=
        sched_time=06-Apr-2020 14:08 start_time=06-Apr-2020 14:08
        end_time=01-Jan-1970 01:00 wait_time=01-Jan-1970 01:00
        db=(nil) db_batch=(nil) batch_started=0
threadid=0x6000007f3b2ec3e7 JobId=0 JobStatus=R jcr=0x7f3b28000900 name=*StatisticsCollector*.2020-04-06_14.08.03_02
threadid=0x6500007f3b2ec3e7 killable=0 JobId=0 JobStatus=R jcr=0x7f3b28000900 name=*StatisticsCollector*.2020-04-06_14.08.03_02
        UseCount=1
        JobType=I JobLevel=
        sched_time=06-Apr-2020 14:08 start_time=06-Apr-2020 14:08
        end_time=01-Jan-1970 01:00 wait_time=01-Jan-1970 01:00
        db=0x7f3b28001ff0 db_batch=(nil) batch_started=0
BareosDb=0x7f3b28001ff0 db_name=bareos db_user=bareos connected=true
        cmd="db_init_database first time
" changes=0
        RWLOCK=0x7f3b28001ff8 w_active=0 w_wait=0

Additional InformationDefault number of maximum connections allowed by the director is 30. When this is reached/exceeded the director process will crash.

TagsNo tags attached.

Activities

jurgengoedbloed

jurgengoedbloed

2021-02-19 10:52

reporter   ~0004090

I upgraded to Bareos 20.0.0 and have the same issue.

I already described it in report 1082 (https://bugs.bareos.org/view.php?id=1082), then it happened under 18.2.5, but I also face the same issue on 19.2 and 20.0
bruno-at-bareos

bruno-at-bareos

2023-07-18 15:52

manager   ~0005209

Doesn't seems to be reproducible with Bareos 22.1.0, please upgrade an retry.

Issue History

Date Modified Username Field Change
2020-04-06 15:15 a.key New Issue
2021-02-19 10:52 jurgengoedbloed Note Added: 0004090
2023-07-18 15:52 bruno-at-bareos Assigned To => bruno-at-bareos
2023-07-18 15:52 bruno-at-bareos Status new => closed
2023-07-18 15:52 bruno-at-bareos Resolution open => unable to reproduce
2023-07-18 15:52 bruno-at-bareos Note Added: 0005209