View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0001035||bareos-core||[All Projects] storage daemon||public||2019-01-15 10:38||2019-02-11 09:43|
|Target Version||Fixed in Version|
|Summary||0001035: storage daemon crash 18.2.4-rc2|
|Description||The storage daemon crashes almost every night.|
I'm evaluating Bareos since a few months. The crashes occured
on 17.2.7 too but only since December 2018.
Maybe it's related to the CentOS 7.6 upgrade?
|Additional Information||Linux 3.10.0-957.1.3.el7.x86_64 0000001 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux|
Name : bareos-storage
Arch : x86_64
Version : 18.2.4rc2
Release : 86.1.el7
Size : 205 k
Repo : installed
From repo : bareos_bareos-18.2
Summary : Bareos Storage daemon
URL : http://www.bareos.org/
Catalog Database Version: 2171
|Tags||No tags attached.|
bareos.619590.traceback (13,232 bytes)
Can you provide a backtrace for 17.2, too?
Does the SD also crash if you disable statistics?
bareos.254990.traceback (6,160 bytes)
With disabled statistics, the SD's didn't crash last night.
But the Director does (bareos.254990.traceback).
I'll disable the storage statistics for the next schedule on the director too.
After that I'll try 17.2.7.
No crashes last night but some errors:
bareos-dir JobId 1161: Fatal error: Unable to authenticate with File daemon at ...
The FD runs on the storage server to backup itself.
Next I'll try 17.2.7
17.2.7 crashes too with activated statistics
See bareos.1491023.traceback and bareos.754636.traceback
bareos.1491023.traceback (7,154 bytes)
bareos.754636.traceback (20,752 bytes)
Now 17.2.7 crashes with statistics off.
bareos.2834141.traceback (7,946 bytes)
bareos.3909678.traceback (8,340 bytes)
We would love to look at your configuration. Can you share your bareos-sd and director configuration? The latter one can be generated with "status all verbose".
Please note that there would be passwords inside the configuration that need to be obfuscated.
If you cannot upload the information here, you can also send it directly to me.
Can you check (e.g. from bareos.log) wether the storage-daemon crashed everytime it labeled a volume in one of the pools with maximum block size set to a custom value?
Indeed, it does. But it didn't crashed many times before.
Here are some bareos.log lines around the crash:
27-Jan 03:43 elbcg1-sd JobId 2362: User defined maximum volume capacity 107,374,182,400 exceeded on device "FileConDemo7" (/mnt/cephfs/bareos/storage/consolidated/demo/).
27-Jan 03:43 elbcg1-sd JobId 2362: End of medium on Volume "ConDemo-0600" Bytes=107,369,988,277 Blocks=25,599 at 27-Jan-2019 03:43.
27-Jan 03:43 bareos-dir JobId 2362: Created new Volume "ConDemo-0604" in catalog.
27-Jan 03:43 elbcg1-sd: ABORTING due to ERROR in smartall.c:229
Overrun buffer: len=145 addr=7f07dc4369f8 allocated: block.c:123 called from mem_pool.c:257
27-Jan 03:43 bareos-dir JobId 2362: Fatal error: Director's comm line to SD dropped.
27-Jan 04:01 elbcg1-sd JobId 2369: End of Volume at file 24 on device "FileIncDemo2" (/mnt/cephfs/bareos/storage/incremental/demo/), Volume "IncDemo-0125"
27-Jan 04:01 elbcg1-sd JobId 2369: Ready to read from volume "IncDemo-0119" on device "FileIncDemo2" (/mnt/cephfs/bareos/storage/incremental/demo/).
27-Jan 04:01 elbcg1-sd JobId 2369: Forward spacing Volume "IncDemo-0119" to file:block 0:207710437.
27-Jan 04:17 elbcg1-sd JobId 2369: User defined maximum volume capacity 107,374,182,400 exceeded on device "FileConDemo1" (/mnt/cephfs/bareos/storage/consolidated/demo/).
27-Jan 04:17 elbcg1-sd JobId 2369: End of medium on Volume "ConDemo-0605" Bytes=107,369,988,300 Blocks=25,599 at 27-Jan-2019 04:17.
27-Jan 04:17 bareos-dir JobId 2369: Created new Volume "ConDemo-0607" in catalog.
27-Jan 04:17 elbcg1-sd: ABORTING due to ERROR in smartall.c:229
Overrun buffer: len=145 addr=7fe0500a46b8 allocated: block.c:123 called from mem_pool.c:257
27-Jan 04:17 bareos-dir JobId 2369: Fatal error: Director's comm line to SD dropped.
Seems like it crashes when labeling a new filesystem volume where Maximum Block Size set.
The setting should only be used on tape drives, so using it on filesystem volumes seems to be not tested thoroughly.
I'll try to reproduce the issue.
Sorry, I cannot make it crash on my machine with the most recent 17.2 branch or 17.2.7 right now.
Is it correct you are running director and sd in version 17.2.7 on a CentOS 7.6? Are you using our RPM packages?
Yes, it's CentOS 7.6 but with self compiled packages.
I made it with instructions from this site https://www.spheron1.uk/2018/03/14/building-bareos-rpms-on-centos-6-7/
The 18.2.4-rc2 version is from your download area.
The Maximum Block Size was set to speedup the I/O to/from our CephFS to tape (4MB object size and hints from https://www.bareos.org/en/Whitepapers/articles/Speed_Tuning_of_Tape_Drives.html).
I'm not sure if it's still necessary. I have a tape server now with lot's of SSDs for spooling to prevent shoe-shining.
What do you think about running the SD with a reasonable debug level and what would that be?
Hmm... can you make this happen with our 18.2.4rc2 packages and statistics disabled?
I had another look at the tracebacks and the type of crash changed when you disabled statistics: With stats enabled you were running into a known issue with statistics. With stats disabled, you're now running into this new issue with labeling.
Apart from your crashes:
We recommend setting Maximum Block Size for tape pools only. Even if you set it for your tapes you don't need to set it for your filesystem volumes. Also, if you're running Ceph, there is a RADOS backend in bareos-sd bypassing the need for cephfs interacting with Ceph directly. Did you already try that?
Have you run into read/write-performance issues with your cephfs setup or why did you set Maximum Block Size on your filesystem volumes at all?
With 18.2.4-rc2 it didn't crash with statistics disabled. But I had it only for one day disabled.
I'll install 18.2.4-rc2 and let it run for a while. Is the statistics issue fixed in that version?
I tried the Droplet backend but had no luck.
Is it worth now to try Droplet or RADOS with 18.2.4-rc2?
Running the SD at another debug level won't help.
What happens is that the buffer for dev->block overruns (i.e. there is a write behind the end of the buffer) and when it is free()d during write_new_volume_label_to_dev()'s call to set_blocksized() the smart allocator kills the SD because it detects the overrun.
I'll need a way to reproduce this and look at it with a debugger. If you can make it crash with our 18.2.4rc2 RPMs, maybe I can make it crash with these, too.
Can you btw. make it crash by issuing a label command in bconsole to label a new volume in one of the Pools with Maximum Block Size set?
|No, it doesn't crash while labeling. Unfortunately it crashes sporadically.|
Hmm.. can't see any differences with or without statistics.
See bareos.1491023.traceback with statistics and bareos.2834141.traceback without.
Both crashes at set_blocksizes -> free_block etc.
The very first crash you posted was different. It crashed during sha1_block_data_order_avx() in openssl.
That's why I suggested to test again with 18.2.4rc2 and disabled statistics.
|just installed 18.2.4-rc2 and have statistics disabled. :-)|
Sorry, but the 18.2.4-rc2 SD crashes too.
Thread 8 (Thread 0x7f3138ff9700 (LWP 2058295)):
#0 0x00007f315d199179 in waitpid () from /lib64/libpthread.so.0
0000001 0x00007f315dace1e4 in SignalHandler (sig=11) at /usr/src/debug/bareos-18.2.4rc2/src/lib/signal.cc:241
0000002 <signal handler called>
0000003 SmartAllocMsg (file=0x7f315daf5a88 "/home/abuild/rpmbuild/BUILD/bareos-18.2.4rc2/src/lib/smartall.cc", line=line@entry=229, fmt=0x7f315daf5b58 "Overrun buffer: len=%d addr=%p allocated: %s:%d called from %s:%d\n", file=0x7f315daf5a88 "/home/abuild/rpmbuild/BUILD/bareos-18.2.4rc2/src/lib/smartall.cc") at /usr/src/debug/bareos-18.2.4rc2/src/lib/smartall.cc:113
0000004 0x00007f315dacef3c in sm_free (file=0x7f315daf2a38 "/home/abuild/rpmbuild/BUILD/bareos-18.2.4rc2/src/lib/mem_pool.cc", line=258, fp=0x7f3114042a48) at /usr/src/debug/bareos-18.2.4rc2/src/lib/smartall.cc:230
0000005 0x00007f315dabee8c in SmFreePoolMemory (fname=fname@entry=0x7f315e1ef280 "/home/abuild/rpmbuild/BUILD/bareos-18.2.4rc2/src/stored/block.cc", lineno=lineno@entry=176, obuf=0x7f3114042a60 "") at /usr/src/debug/bareos-18.2.4rc2/src/lib/mem_pool.cc:258
0000006 0x00007f315e1ca6da in storagedaemon::FreeBlock (block=<optimized out>) at /usr/src/debug/bareos-18.2.4rc2/src/stored/block.cc:176
0000007 0x00007f315e1d0389 in storagedaemon::Device::SetBlocksizes (this=0x7f314c038178, dcr=dcr@entry=0x7f3114052c98) at /usr/src/debug/bareos-18.2.4rc2/src/stored/dev.cc:555
0000008 0x00007f315e1d4e9c in storagedaemon::WriteNewVolumeLabelToDev (dcr=dcr@entry=0x7f3114052c98, VolName=VolName@entry=0x7f3114052da8 "ConDemo-0700", PoolName=PoolName@entry=0x7f3114052e28 "ConDemo", relabel=relabel@entry=false) at /usr/src/debug/bareos-18.2.4rc2/src/stored/label.cc:471
This is actually "good" news. If our 18.2 packages crash then it is not your build that is broken, but the software itself.
I'll see whether or not the 18.2 packages you've installed are compiled with support for heap corruption detection and come back to you.
Can you try with the packages from http://download.bareos.org/bareos/people/arogge/smcheck/CentOS_7/x86_64/ ?
If you run "bareos-sd -f -d 200 > /some/logfile" it will run in debug mode with memory corruption detection enabled. The resulting logfile might help me to understand the issue.
With the debug version the SD doesn't crash anymore.
But on the other hand I've changed the config a little because the consolidation jobs failed.
Last night the SD stops responding to the director.
It is still running if you need some debug information of the process.
|2019-01-15 10:38||fbrendel||New Issue|
|2019-01-15 10:38||fbrendel||File Added: bareos.619590.traceback|
|2019-01-15 15:39||arogge||Assigned To||=> arogge|
|2019-01-15 15:39||arogge||Status||new => assigned|
|2019-01-15 15:40||arogge||Note Added: 0003177|
|2019-01-15 15:40||arogge||Status||assigned => feedback|
|2019-01-16 09:08||fbrendel||File Added: bareos.254990.traceback|
|2019-01-16 09:09||fbrendel||Note Added: 0003178|
|2019-01-16 09:09||fbrendel||Status||feedback => assigned|
|2019-01-17 10:51||fbrendel||Note Added: 0003203|
|2019-01-21 10:12||fbrendel||File Added: bareos.1491023.traceback|
|2019-01-21 10:12||fbrendel||File Added: bareos.754636.traceback|
|2019-01-21 10:12||fbrendel||Note Added: 0003204|
|2019-01-28 10:35||fbrendel||File Added: bareos.2834141.traceback|
|2019-01-28 10:35||fbrendel||File Added: bareos.3909678.traceback|
|2019-01-28 10:35||fbrendel||Note Added: 0003211|
|2019-01-28 12:19||arogge||Note Added: 0003212|
|2019-01-29 13:22||arogge||Status||assigned => feedback|
|2019-01-29 13:22||arogge||Note Added: 0003213|
|2019-01-29 13:22||arogge||Note Edited: 0003213||View Revisions|
|2019-01-29 13:51||fbrendel||Note Added: 0003214|
|2019-01-29 13:51||fbrendel||Status||feedback => assigned|
|2019-01-29 14:15||arogge||Note Added: 0003215|
|2019-01-29 15:00||arogge||Status||assigned => feedback|
|2019-01-29 15:00||arogge||Note Added: 0003216|
|2019-01-29 15:00||arogge||Note Edited: 0003216||View Revisions|
|2019-01-29 15:26||fbrendel||Note Added: 0003217|
|2019-01-29 15:26||fbrendel||Status||feedback => assigned|
|2019-01-29 15:45||arogge||Note Added: 0003218|
|2019-01-29 16:03||fbrendel||Note Added: 0003219|
|2019-01-29 16:07||arogge||Note Added: 0003220|
|2019-01-29 16:23||fbrendel||Note Added: 0003221|
|2019-01-29 16:41||fbrendel||Note Added: 0003222|
|2019-01-29 17:31||arogge||Note Added: 0003223|
|2019-01-29 17:39||fbrendel||Note Added: 0003224|
|2019-01-30 09:57||fbrendel||Note Added: 0003225|
|2019-01-30 11:41||arogge||Note Added: 0003226|
|2019-01-31 15:50||arogge||Note Added: 0003229|
|2019-01-31 15:50||arogge||Status||assigned => feedback|
|2019-02-11 09:43||fbrendel||Note Added: 0003256|
|2019-02-11 09:43||fbrendel||Status||feedback => assigned|