View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0001082||bareos-core||[All Projects] director||public||2019-04-30 14:23||2019-07-10 17:45|
|Fixed in Version|
|Summary||0001082: Bareos director crashes with segfault when restarting or reload from console|
|Description||After a config change, I reloaded the bareos director and it crashed with a segfault.|
After I restart the Bareos director, it seems to run for about a minute but after then it crashes again. Sometimes almost directly, sometimes after a couple of minutes.
At startup, Bareos doesn't complain about a config error. After a while it just stops.
I already had the same issue in the past, but then managed to get the system up and running after waiting a considerable amount of time (> 1 day) and then restarting the director. It had then been running and backing up for over two weeks.
The director and storage daemon are running 18.2.5 all clients run 17.2.4 or 18.2.5, all runnining on Centos 7. The director and storage (both on the same machine) run on a fully patched Centos 7 machine.
I've had the same issue with the director on version 17.2.4 and a self-compiled 17.2.5
I suspect that it has to do with the fact that all clients use the 'client initiated connection' and something goes wrong as soon as clients reconnect after restart of the director. A race condition, lack of resources..?
|Steps To Reproduce||When the crash occurs:|
- Start the bareos director
- Within a minute, the direct will crash again.
|Additional Information||As requested by Andreas, created this bug and attached the traceback file.|
|Tags||No tags attached.|
bareos-dir.1640.bactrace (1,079 bytes)
|Does the problem persist if you disable statistics collection?|
Yes. Statistics collection was already turned off.
The database tables 'devicestats' and 'jobstats' are also empty.
To add to this...
The director had no statistics enabled.
The storage daemon has.
I have disabled it and restarted the storage daemon.
Then I restarted the director and it crashed again.
What I did then was the following:
Stop the storage daemon
start the director and monitor if it would keep running. It keeps on running
Then start the storage daemon
The director now keeps running.
Did a small test backup job: runs fine.
Tonight a batch of backup jobs will run, tomorrow I will let you know the outcome.
All backups ran fine this night.
Is there anything I can test or try?
You can check if you have a meaningful 'traceback' file next to the bactrace you attached.
If you have gdb and the debug packages installed (no performance penalties) then a crash will produce a traceback file where we can see exactly in what function on what line the crash has happened. This helps us tracking down the crash a lot.
bareos.1640.traceback (1,657 bytes)
|Yes, I have. Here is the corresponding traceback file.|
|From the traceback file (it is a simple text file) it looks like your debug packages don't match the binary packages you've got installed. Could you check this?|
I installed from the bareos repository.
Noticed that the package bareos-debuginfo was still 18.2.4rc. Updated to 18.2.5 and restarted director and storage (they are on the same machine).
If that is what you meant, then the versions should be the same now.
Here is a new traceback file
bareos.63319.traceback (12,707 bytes)
From what I see you're right: you're using client initiated connection and something is wrong with the connection-pool.
When the client 'tst-civ-nominatim' connects to the director the director then should add that connection to the connection pool. However, it looks like the connection pool had already been destroyed at this point.
Is this reproducible with just one client?
I will have to reproduce it myself so we can write a test for this, and I would be glad if there was a simple way to reproduce it.
|I will test this after the weekend and let you know.|
I did some tests, but at the moment the director is running find and I cannot reproduce the crash.
I'll let you know once the director crashes again.
I got another crash.
Disabled access from all filedaemons by blocking them with iptables, except for one host.
In this situation, the director keeps on running.
|After a minute or so, I removed the iptables block rule. All clients are now connected, the director now seems to run fine.|
It seems that we have crossed a threshold in the number of clients.
I block access to the director except for a small number of clients (<30).
Start the director -> runs fine and shows client initiated connection clients.
As soon as I remove the blockage, the director crashes.
To rule out 'bad clients', I have tried to block different parts of the network, but no solution.
The only succes I'm now having is this:
- Block all clients
- Stop storage daemon (runs on the same machine)
- Start director
- Allow clients subnet by subnet
- Remove last blockage
- Start storage.
Anything I can do to test?
|2019-04-30 14:23||jurgengoedbloed||New Issue|
|2019-04-30 14:23||jurgengoedbloed||File Added: bareos-dir.1640.bactrace|
|2019-04-30 15:06||arogge||Assigned To||=> arogge|
|2019-04-30 15:06||arogge||Status||new => feedback|
|2019-04-30 15:06||arogge||Note Added: 0003349|
|2019-05-02 17:12||jurgengoedbloed||Note Added: 0003352|
|2019-05-02 17:12||jurgengoedbloed||Status||feedback => assigned|
|2019-05-02 17:25||jurgengoedbloed||Note Added: 0003353|
|2019-05-03 09:13||jurgengoedbloed||Note Added: 0003354|
|2019-05-03 09:58||arogge||Note Added: 0003355|
|2019-05-03 10:01||arogge||Status||assigned => feedback|
|2019-05-03 10:12||jurgengoedbloed||File Added: bareos.1640.traceback|
|2019-05-03 10:12||jurgengoedbloed||Note Added: 0003356|
|2019-05-03 10:12||jurgengoedbloed||Status||feedback => assigned|
|2019-05-03 10:16||arogge||Status||assigned => feedback|
|2019-05-03 10:16||arogge||Note Added: 0003357|
|2019-05-03 10:55||jurgengoedbloed||Note Added: 0003358|
|2019-05-03 10:55||jurgengoedbloed||Status||feedback => assigned|
|2019-05-03 11:29||jurgengoedbloed||File Added: bareos.63319.traceback|
|2019-05-03 11:29||jurgengoedbloed||Note Added: 0003359|
|2019-05-03 11:58||arogge||Status||assigned => acknowledged|
|2019-05-03 11:58||arogge||Note Added: 0003360|
|2019-05-03 17:10||jurgengoedbloed||Note Added: 0003361|
|2019-05-09 15:34||jurgengoedbloed||Note Added: 0003363|
|2019-05-15 08:29||jurgengoedbloed||Note Added: 0003369|
|2019-05-15 08:42||jurgengoedbloed||Note Added: 0003370|
|2019-05-24 15:03||jurgengoedbloed||Note Added: 0003381|
|2019-07-10 17:45||arogge||Assigned To||arogge =>|