View Issue Details

IDProjectCategoryView StatusLast Update
0001082bareos-core[All Projects] directorpublic2019-05-24 15:03
ReporterjurgengoedbloedAssigned Toarogge 
PrioritynormalSeveritycrashReproducibilitysometimes
Status acknowledgedResolutionopen 
PlatformLinuxOSCentOSOS Version7
Product Version18.2.5 
Target VersionFixed in Version 
Summary0001082: Bareos director crashes with segfault when restarting or reload from console
DescriptionAfter a config change, I reloaded the bareos director and it crashed with a segfault.

After I restart the Bareos director, it seems to run for about a minute but after then it crashes again. Sometimes almost directly, sometimes after a couple of minutes.

At startup, Bareos doesn't complain about a config error. After a while it just stops.

I already had the same issue in the past, but then managed to get the system up and running after waiting a considerable amount of time (> 1 day) and then restarting the director. It had then been running and backing up for over two weeks.

The director and storage daemon are running 18.2.5 all clients run 17.2.4 or 18.2.5, all runnining on Centos 7. The director and storage (both on the same machine) run on a fully patched Centos 7 machine.

I've had the same issue with the director on version 17.2.4 and a self-compiled 17.2.5

I suspect that it has to do with the fact that all clients use the 'client initiated connection' and something goes wrong as soon as clients reconnect after restart of the director. A race condition, lack of resources..?
Steps To ReproduceWhen the crash occurs:
- Start the bareos director
- Within a minute, the direct will crash again.
Additional InformationAs requested by Andreas, created this bug and attached the traceback file.
TagsNo tags attached.
bareos-master: impact
bareos-master: action
bareos-18.2: impact
bareos-18.2: action
bareos-17.2: impact
bareos-17.2: action
bareos-16.2: impact
bareos-16.2: action
bareos-15.2: impact
bareos-15.2: action
bareos-14.2: impact
bareos-14.2: action
bareos-13.2: impact
bareos-13.2: action
bareos-12.4: impact
bareos-12.4: action

Activities

jurgengoedbloed

jurgengoedbloed

2019-04-30 14:23

reporter  

bareos-dir.1640.bactrace (1,079 bytes)
arogge

arogge

2019-04-30 15:06

developer   ~0003349

Does the problem persist if you disable statistics collection?
jurgengoedbloed

jurgengoedbloed

2019-05-02 17:12

reporter   ~0003352

Yes. Statistics collection was already turned off.
The database tables 'devicestats' and 'jobstats' are also empty.
jurgengoedbloed

jurgengoedbloed

2019-05-02 17:25

reporter   ~0003353

To add to this...
The director had no statistics enabled.
The storage daemon has.
I have disabled it and restarted the storage daemon.
Then I restarted the director and it crashed again.

What I did then was the following:
Stop the storage daemon
start the director and monitor if it would keep running. It keeps on running
Then start the storage daemon
The director now keeps running.
Did a small test backup job: runs fine.
Tonight a batch of backup jobs will run, tomorrow I will let you know the outcome.
jurgengoedbloed

jurgengoedbloed

2019-05-03 09:13

reporter   ~0003354

All backups ran fine this night.
Is there anything I can test or try?
arogge

arogge

2019-05-03 09:58

developer   ~0003355

You can check if you have a meaningful 'traceback' file next to the bactrace you attached.
If you have gdb and the debug packages installed (no performance penalties) then a crash will produce a traceback file where we can see exactly in what function on what line the crash has happened. This helps us tracking down the crash a lot.
jurgengoedbloed

jurgengoedbloed

2019-05-03 10:12

reporter  

bareos.1640.traceback (1,657 bytes)
jurgengoedbloed

jurgengoedbloed

2019-05-03 10:12

reporter   ~0003356

Yes, I have. Here is the corresponding traceback file.
arogge

arogge

2019-05-03 10:16

developer   ~0003357

From the traceback file (it is a simple text file) it looks like your debug packages don't match the binary packages you've got installed. Could you check this?
jurgengoedbloed

jurgengoedbloed

2019-05-03 10:55

reporter   ~0003358

I installed from the bareos repository.
Noticed that the package bareos-debuginfo was still 18.2.4rc. Updated to 18.2.5 and restarted director and storage (they are on the same machine).
If that is what you meant, then the versions should be the same now.
jurgengoedbloed

jurgengoedbloed

2019-05-03 11:29

reporter   ~0003359

Here is a new traceback file

bareos.63319.traceback (12,707 bytes)
arogge

arogge

2019-05-03 11:58

developer   ~0003360

From what I see you're right: you're using client initiated connection and something is wrong with the connection-pool.
When the client 'tst-civ-nominatim' connects to the director the director then should add that connection to the connection pool. However, it looks like the connection pool had already been destroyed at this point.
Is this reproducible with just one client?
I will have to reproduce it myself so we can write a test for this, and I would be glad if there was a simple way to reproduce it.
jurgengoedbloed

jurgengoedbloed

2019-05-03 17:10

reporter   ~0003361

I will test this after the weekend and let you know.
jurgengoedbloed

jurgengoedbloed

2019-05-09 15:34

reporter   ~0003363

I did some tests, but at the moment the director is running find and I cannot reproduce the crash.
I'll let you know once the director crashes again.
jurgengoedbloed

jurgengoedbloed

2019-05-15 08:29

reporter   ~0003369

I got another crash.
Disabled access from all filedaemons by blocking them with iptables, except for one host.
In this situation, the director keeps on running.
jurgengoedbloed

jurgengoedbloed

2019-05-15 08:42

reporter   ~0003370

After a minute or so, I removed the iptables block rule. All clients are now connected, the director now seems to run fine.
jurgengoedbloed

jurgengoedbloed

2019-05-24 15:03

reporter   ~0003381

It seems that we have crossed a threshold in the number of clients.

I block access to the director except for a small number of clients (<30).
Start the director -> runs fine and shows client initiated connection clients.
As soon as I remove the blockage, the director crashes.

To rule out 'bad clients', I have tried to block different parts of the network, but no solution.

The only succes I'm now having is this:
- Block all clients
- Stop storage daemon (runs on the same machine)
- Start director
- Allow clients subnet by subnet
- Remove last blockage
- Start storage.

Anything I can do to test?

Issue History

Date Modified Username Field Change
2019-04-30 14:23 jurgengoedbloed New Issue
2019-04-30 14:23 jurgengoedbloed File Added: bareos-dir.1640.bactrace
2019-04-30 15:06 arogge Assigned To => arogge
2019-04-30 15:06 arogge Status new => feedback
2019-04-30 15:06 arogge Note Added: 0003349
2019-05-02 17:12 jurgengoedbloed Note Added: 0003352
2019-05-02 17:12 jurgengoedbloed Status feedback => assigned
2019-05-02 17:25 jurgengoedbloed Note Added: 0003353
2019-05-03 09:13 jurgengoedbloed Note Added: 0003354
2019-05-03 09:58 arogge Note Added: 0003355
2019-05-03 10:01 arogge Status assigned => feedback
2019-05-03 10:12 jurgengoedbloed File Added: bareos.1640.traceback
2019-05-03 10:12 jurgengoedbloed Note Added: 0003356
2019-05-03 10:12 jurgengoedbloed Status feedback => assigned
2019-05-03 10:16 arogge Status assigned => feedback
2019-05-03 10:16 arogge Note Added: 0003357
2019-05-03 10:55 jurgengoedbloed Note Added: 0003358
2019-05-03 10:55 jurgengoedbloed Status feedback => assigned
2019-05-03 11:29 jurgengoedbloed File Added: bareos.63319.traceback
2019-05-03 11:29 jurgengoedbloed Note Added: 0003359
2019-05-03 11:58 arogge Status assigned => acknowledged
2019-05-03 11:58 arogge Note Added: 0003360
2019-05-03 17:10 jurgengoedbloed Note Added: 0003361
2019-05-09 15:34 jurgengoedbloed Note Added: 0003363
2019-05-15 08:29 jurgengoedbloed Note Added: 0003369
2019-05-15 08:42 jurgengoedbloed Note Added: 0003370
2019-05-24 15:03 jurgengoedbloed Note Added: 0003381