View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001222 | bareos-core | director | public | 2020-04-06 15:15 | 2023-07-18 15:52 |
Reporter | a.key | Assigned To | bruno-at-bareos | ||
Priority | high | Severity | crash | Reproducibility | always |
Status | closed | Resolution | unable to reproduce | ||
Platform | Linux | OS | CentOS | OS Version | 7 |
Product Version | 19.2.6 | ||||
Summary | 0001222: Director crashes with SIGFAULT when maximum number of threads (connections) | ||||
Description | The director process crashes with SIGFAULT when a large number of clients connect simultaneously. The director logs only: 06-Apr 13:49 director.example.com ABORTING due to ERROR lib/bnet_server_tcp.cc:410 Could not add thread to list. | ||||
Steps To Reproduce | Configure a large number of clients ( above 50) with "Client initiated connections". As the connection "wave" may come with a random delay after starting the director reduce the maximum number of connections from the default 30 to less (In my case for 95 clients) reducing the maximum connections to 29 crashes the director every time when the wave of connections come in. Director { # define myself Name = director.example.com QueryFile = "/usr/lib/bareos/scripts/query.sql" Password = "<PASSWORD>" # Console password Messages = Daemon Auditing = yes Maximum Connections = 29 #TLS Enable = yes #TLS Require = yes #TLS Verify Peer = yes # TLS DH File = /etc/bareos/tls/dh2048.pem TLS CA Certificate File = /etc/bareos/tls/ca.pem TLS Key = /etc/bareos/tls/dir.key TLS Certificate = /etc/bareos/tls/dir.crt } Start the director daemon in debugging: bareos-dir -u bareos -v -f -d 999 -dt After the initial process setup, connecting to the catalog and storage daemon... when the wave of connections hit the director it will crash with: 06-Apr-2020 14:08:33.557199 director.example.com (900): lib/btimers.cc:178-0 Start bsock timer 7f3ac801ce30 tid=0x5800007f3ad77fe7 for 600 secs at 1586178513 06-Apr-2020 14:08:33.557227 director.example.com (50): lib/cram_md5.cc:82-0 send: auth cram-md5 <54030378.1586178513@director.example.com> ssl=1 06-Apr-2020 14:08:33.560315 director.example.com (50): lib/bnet.cc:142-0 TLS server negotiation established. 06-Apr-2020 14:08:33.560349 director.example.com (900): lib/btimers.cc:178-0 Start bsock timer 7f3ad001c5f0 tid=0x5800007f3ad7fff7 for 600 secs at 1586178513 06-Apr-2020 14:08:33.560377 director.example.com (50): lib/cram_md5.cc:82-0 send: auth cram-md5 <1064229369.1586178513@director.example.com> ssl=1 06-Apr-2020 14:08:33.560453 director.example.com (800): lib/thread_list.cc:243-0 Run WorkerThread successfully. 06-Apr-2020 14:08:33.560493 director.example.com (100): lib/bsock.cc:84-0 Construct BareosSocket 06-Apr-2020 14:08:33.560505 director.example.com (800): lib/thread_list.cc:216-0 Number of maximum threads exceeded: 29 06-Apr-2020 14:08:33.560529 director.example.com (850): lib/message.cc:1309-0 Enter Jmsg type=1 06-Apr-2020 14:08:33.560547 director.example.com (850): lib/message.cc:641-0 Enter DispatchMessage type=1 msg=director.example.com ABORTING due to ERROR lib/bnet_server_tcp.cc:410 Could not add thread to list. director.example.com ABORTING due to ERROR lib/bnet_server_tcp.cc:410 Could not add thread to list. 06-Apr-2020 14:08:33.561357 director.example.com (850): lib/message.cc:873-0 APPEND for following msg: director.example.com ABORTING due to ERROR lib/bnet_server_tcp.cc:410 Could not add thread to list. 06-Apr-2020 14:08:33.561754 director.example.com (850): lib/message.cc:754-0 CONSOLE for following msg: director.example.com ABORTING due to ERROR lib/bnet_server_tcp.cc:410 Could not add thread to list. 06-Apr-2020 14:08:33.561926 director.example.com (850): lib/message.cc:847-0 MAIL for following msg: director.example.com ABORTING due to ERROR lib/bnet_server_tcp.cc:410 Could not add thread to list. 06-Apr-2020 14:08:33.561941 director.example.com (850): lib/message.cc:299-0 mailname=/var/lib/bareos/director.example.com.director.example.com.17910416.mail BAREOS forced SEG FAULT to obtain traceback. 06-Apr-2020 14:08:33.562130 director.example.com (900): lib/signal.cc:119-0 sig=11 Segmentation violation BAREOS interrupted by signal 11: Segmentation violation bareos-dir, director.example.com got signal 11 - Segmentation violation. Attempting traceback. exepath=/etc/bareos/bareos-dir.d 06-Apr-2020 14:08:33.562206 director.example.com (300): lib/signal.cc:183-0 Working=/var/lib/bareos 06-Apr-2020 14:08:33.562218 director.example.com (300): lib/signal.cc:184-0 btpath=/etc/bareos/bareos-dir.d/btraceback 06-Apr-2020 14:08:33.562228 director.example.com (300): lib/signal.cc:185-0 exepath=/etc/bareos/bareos-dir.d/bareos-dir 06-Apr-2020 14:08:33.563057 director.example.com (500): lib/signal.cc:216-0 Doing waitpid Calling: /etc/bareos/bareos-dir.d/btraceback /etc/bareos/bareos-dir.d/bareos-dir 17393 /var/lib/bareos The traceback file is the most unhelpful debug output ever in this case: # cat /var/lib/bareos/director.example.com.17393.bactrace Attempt to dump current JCRs. njcrs=2 threadid=0x3900007f3b3b6108 JobId=0 JobStatus=R jcr=0x1147550 name=*JobMonitor*.2020-04-06_14.08.03_01 threadid=0x3000007f3b3b6108 killable=0 JobId=0 JobStatus=R jcr=0x1147550 name=*JobMonitor*.2020-04-06_14.08.03_01 UseCount=1 JobType=I JobLevel= sched_time=06-Apr-2020 14:08 start_time=06-Apr-2020 14:08 end_time=01-Jan-1970 01:00 wait_time=01-Jan-1970 01:00 db=(nil) db_batch=(nil) batch_started=0 threadid=0x6000007f3b2ec3e7 JobId=0 JobStatus=R jcr=0x7f3b28000900 name=*StatisticsCollector*.2020-04-06_14.08.03_02 threadid=0x6500007f3b2ec3e7 killable=0 JobId=0 JobStatus=R jcr=0x7f3b28000900 name=*StatisticsCollector*.2020-04-06_14.08.03_02 UseCount=1 JobType=I JobLevel= sched_time=06-Apr-2020 14:08 start_time=06-Apr-2020 14:08 end_time=01-Jan-1970 01:00 wait_time=01-Jan-1970 01:00 db=0x7f3b28001ff0 db_batch=(nil) batch_started=0 BareosDb=0x7f3b28001ff0 db_name=bareos db_user=bareos connected=true cmd="db_init_database first time " changes=0 RWLOCK=0x7f3b28001ff8 w_active=0 w_wait=0 | ||||
Additional Information | Default number of maximum connections allowed by the director is 30. When this is reached/exceeded the director process will crash. | ||||
Tags | No tags attached. | ||||
I upgraded to Bareos 20.0.0 and have the same issue. I already described it in report 1082 (https://bugs.bareos.org/view.php?id=1082), then it happened under 18.2.5, but I also face the same issue on 19.2 and 20.0 |
|
Doesn't seems to be reproducible with Bareos 22.1.0, please upgrade an retry. | |
Date Modified | Username | Field | Change |
---|---|---|---|
2020-04-06 15:15 | a.key | New Issue | |
2021-02-19 10:52 | jurgengoedbloed | Note Added: 0004090 | |
2023-07-18 15:52 | bruno-at-bareos | Assigned To | => bruno-at-bareos |
2023-07-18 15:52 | bruno-at-bareos | Status | new => closed |
2023-07-18 15:52 | bruno-at-bareos | Resolution | open => unable to reproduce |
2023-07-18 15:52 | bruno-at-bareos | Note Added: 0005209 |