View Issue Details

IDProjectCategoryView StatusLast Update
0001222bareos-core[All Projects] directorpublic2020-04-14 13:49
Reportera.keyAssigned To 
PriorityhighSeveritycrashReproducibilityalways
Status newResolutionopen 
PlatformLinuxOSCentOSOS Version7
Product Version19.2.6 
Fixed in Version 
Summary0001222: Director crashes with SIGFAULT when maximum number of threads (connections)
DescriptionThe director process crashes with SIGFAULT when a large number of clients connect simultaneously. The director logs only:

06-Apr 13:49 director.example.com ABORTING due to ERROR
lib/bnet_server_tcp.cc:410 Could not add thread to list.

 
Steps To ReproduceConfigure a large number of clients ( above 50) with "Client initiated connections".
As the connection "wave" may come with a random delay after starting the director reduce the maximum number of connections from the default 30 to less (In my case for 95 clients) reducing the maximum connections to 29 crashes the director every time when the wave of connections come in.

Director { # define myself
  Name = director.example.com
  QueryFile = "/usr/lib/bareos/scripts/query.sql"
  Password = "<PASSWORD>" # Console password
  Messages = Daemon
  Auditing = yes

  Maximum Connections = 29

  #TLS Enable = yes
  #TLS Require = yes
  #TLS Verify Peer = yes
  #
  TLS DH File = /etc/bareos/tls/dh2048.pem
  TLS CA Certificate File = /etc/bareos/tls/ca.pem
  TLS Key = /etc/bareos/tls/dir.key
  TLS Certificate = /etc/bareos/tls/dir.crt
}

Start the director daemon in debugging:
bareos-dir -u bareos -v -f -d 999 -dt


After the initial process setup, connecting to the catalog and storage daemon... when the wave of connections hit the director it will crash with:

06-Apr-2020 14:08:33.557199 director.example.com (900): lib/btimers.cc:178-0 Start bsock timer 7f3ac801ce30 tid=0x5800007f3ad77fe7 for 600 secs at 1586178513
06-Apr-2020 14:08:33.557227 director.example.com (50): lib/cram_md5.cc:82-0 send: auth cram-md5 <54030378.1586178513@director.example.com> ssl=1
06-Apr-2020 14:08:33.560315 director.example.com (50): lib/bnet.cc:142-0 TLS server negotiation established.
06-Apr-2020 14:08:33.560349 director.example.com (900): lib/btimers.cc:178-0 Start bsock timer 7f3ad001c5f0 tid=0x5800007f3ad7fff7 for 600 secs at 1586178513
06-Apr-2020 14:08:33.560377 director.example.com (50): lib/cram_md5.cc:82-0 send: auth cram-md5 <1064229369.1586178513@director.example.com> ssl=1
06-Apr-2020 14:08:33.560453 director.example.com (800): lib/thread_list.cc:243-0 Run WorkerThread successfully.
06-Apr-2020 14:08:33.560493 director.example.com (100): lib/bsock.cc:84-0 Construct BareosSocket
06-Apr-2020 14:08:33.560505 director.example.com (800): lib/thread_list.cc:216-0 Number of maximum threads exceeded: 29
06-Apr-2020 14:08:33.560529 director.example.com (850): lib/message.cc:1309-0 Enter Jmsg type=1
06-Apr-2020 14:08:33.560547 director.example.com (850): lib/message.cc:641-0 Enter DispatchMessage type=1 msg=director.example.com ABORTING due to ERROR
lib/bnet_server_tcp.cc:410 Could not add thread to list.
director.example.com ABORTING due to ERROR
lib/bnet_server_tcp.cc:410 Could not add thread to list.
06-Apr-2020 14:08:33.561357 director.example.com (850): lib/message.cc:873-0 APPEND for following msg: director.example.com ABORTING due to ERROR
lib/bnet_server_tcp.cc:410 Could not add thread to list.
06-Apr-2020 14:08:33.561754 director.example.com (850): lib/message.cc:754-0 CONSOLE for following msg: director.example.com ABORTING due to ERROR
lib/bnet_server_tcp.cc:410 Could not add thread to list.
06-Apr-2020 14:08:33.561926 director.example.com (850): lib/message.cc:847-0 MAIL for following msg: director.example.com ABORTING due to ERROR
lib/bnet_server_tcp.cc:410 Could not add thread to list.
06-Apr-2020 14:08:33.561941 director.example.com (850): lib/message.cc:299-0 mailname=/var/lib/bareos/director.example.com.director.example.com.17910416.mail
BAREOS forced SEG FAULT to obtain traceback.
06-Apr-2020 14:08:33.562130 director.example.com (900): lib/signal.cc:119-0 sig=11 Segmentation violation
BAREOS interrupted by signal 11: Segmentation violation
bareos-dir, director.example.com got signal 11 - Segmentation violation. Attempting traceback.
exepath=/etc/bareos/bareos-dir.d
06-Apr-2020 14:08:33.562206 director.example.com (300): lib/signal.cc:183-0 Working=/var/lib/bareos
06-Apr-2020 14:08:33.562218 director.example.com (300): lib/signal.cc:184-0 btpath=/etc/bareos/bareos-dir.d/btraceback
06-Apr-2020 14:08:33.562228 director.example.com (300): lib/signal.cc:185-0 exepath=/etc/bareos/bareos-dir.d/bareos-dir
06-Apr-2020 14:08:33.563057 director.example.com (500): lib/signal.cc:216-0 Doing waitpid
Calling: /etc/bareos/bareos-dir.d/btraceback /etc/bareos/bareos-dir.d/bareos-dir 17393 /var/lib/bareos


The traceback file is the most unhelpful debug output ever in this case:
# cat /var/lib/bareos/director.example.com.17393.bactrace
Attempt to dump current JCRs. njcrs=2
threadid=0x3900007f3b3b6108 JobId=0 JobStatus=R jcr=0x1147550 name=*JobMonitor*.2020-04-06_14.08.03_01
threadid=0x3000007f3b3b6108 killable=0 JobId=0 JobStatus=R jcr=0x1147550 name=*JobMonitor*.2020-04-06_14.08.03_01
        UseCount=1
        JobType=I JobLevel=
        sched_time=06-Apr-2020 14:08 start_time=06-Apr-2020 14:08
        end_time=01-Jan-1970 01:00 wait_time=01-Jan-1970 01:00
        db=(nil) db_batch=(nil) batch_started=0
threadid=0x6000007f3b2ec3e7 JobId=0 JobStatus=R jcr=0x7f3b28000900 name=*StatisticsCollector*.2020-04-06_14.08.03_02
threadid=0x6500007f3b2ec3e7 killable=0 JobId=0 JobStatus=R jcr=0x7f3b28000900 name=*StatisticsCollector*.2020-04-06_14.08.03_02
        UseCount=1
        JobType=I JobLevel=
        sched_time=06-Apr-2020 14:08 start_time=06-Apr-2020 14:08
        end_time=01-Jan-1970 01:00 wait_time=01-Jan-1970 01:00
        db=0x7f3b28001ff0 db_batch=(nil) batch_started=0
BareosDb=0x7f3b28001ff0 db_name=bareos db_user=bareos connected=true
        cmd="db_init_database first time
" changes=0
        RWLOCK=0x7f3b28001ff8 w_active=0 w_wait=0

Additional InformationDefault number of maximum connections allowed by the director is 30. When this is reached/exceeded the director process will crash.

TagsNo tags attached.
bareos-master: impact
bareos-master: action
bareos-19.2: impact
bareos-19.2: action
bareos-18.2: impact
bareos-18.2: action
bareos-17.2: impact
bareos-17.2: action
bareos-16.2: impact
bareos-16.2: action
bareos-15.2: impact
bareos-15.2: action
bareos-14.2: impact
bareos-14.2: action
bareos-13.2: impact
bareos-13.2: action
bareos-12.4: impact
bareos-12.4: action

Activities

There are no notes attached to this issue.

Issue History

Date Modified Username Field Change
2020-04-06 15:15 a.key New Issue