View Issue Details

IDProjectCategoryView StatusLast Update
0001035bareos-core[All Projects] storage daemonpublic2019-03-14 09:45
ReporterfbrendelAssigned Toarogge 
PrioritynormalSeverityminorReproducibilityalways
Status closedResolutionunable to reproduce 
PlatformLinuxOSCentOSOS Version7
Product Version 
Target VersionFixed in Version 
Summary0001035: storage daemon crash 18.2.4-rc2
DescriptionThe storage daemon crashes almost every night.
I'm evaluating Bareos since a few months. The crashes occured
on 17.2.7 too but only since December 2018.
Maybe it's related to the CentOS 7.6 upgrade?
Additional InformationLinux 3.10.0-957.1.3.el7.x86_64 0000001 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Installed Packages
Name : bareos-storage
Arch : x86_64
Version : 18.2.4rc2
Release : 86.1.el7
Size : 205 k
Repo : installed
From repo : bareos_bareos-18.2
Summary : Bareos Storage daemon
URL : http://www.bareos.org/

Catalog Database Version: 2171
TagsNo tags attached.
bareos-master: impact
bareos-master: action
bareos-18.2: impact
bareos-18.2: action
bareos-17.2: impact
bareos-17.2: action
bareos-16.2: impact
bareos-16.2: action
bareos-15.2: impact
bareos-15.2: action
bareos-14.2: impact
bareos-14.2: action
bareos-13.2: impact
bareos-13.2: action
bareos-12.4: impact
bareos-12.4: action

Activities

fbrendel

fbrendel

2019-01-15 10:38

reporter  

bareos.619590.traceback (13,232 bytes)
arogge

arogge

2019-01-15 15:40

developer   ~0003177

Can you provide a backtrace for 17.2, too?
Does the SD also crash if you disable statistics?

Thank you!
fbrendel

fbrendel

2019-01-16 09:08

reporter  

bareos.254990.traceback (6,160 bytes)
fbrendel

fbrendel

2019-01-16 09:09

reporter   ~0003178

With disabled statistics, the SD's didn't crash last night.
But the Director does (bareos.254990.traceback).
I'll disable the storage statistics for the next schedule on the director too.

After that I'll try 17.2.7.

Thanks!
fbrendel

fbrendel

2019-01-17 10:51

reporter   ~0003203

No crashes last night but some errors:
bareos-dir JobId 1161: Fatal error: Unable to authenticate with File daemon at ...
The FD runs on the storage server to backup itself.

Next I'll try 17.2.7
fbrendel

fbrendel

2019-01-21 10:12

reporter   ~0003204

17.2.7 crashes too with activated statistics
See bareos.1491023.traceback and bareos.754636.traceback

bareos.1491023.traceback (7,154 bytes)
bareos.754636.traceback (20,752 bytes)
fbrendel

fbrendel

2019-01-28 10:35

reporter   ~0003211

Now 17.2.7 crashes with statistics off.

bareos.2834141.traceback (7,946 bytes)
bareos.3909678.traceback (8,340 bytes)
arogge

arogge

2019-01-28 12:19

developer   ~0003212

We would love to look at your configuration. Can you share your bareos-sd and director configuration? The latter one can be generated with "status all verbose".
Please note that there would be passwords inside the configuration that need to be obfuscated.

If you cannot upload the information here, you can also send it directly to me.
arogge

arogge

2019-01-29 13:22

developer   ~0003213

Last edited: 2019-01-29 13:22

View 2 revisions

Can you check (e.g. from bareos.log) wether the storage-daemon crashed everytime it labeled a volume in one of the pools with maximum block size set to a custom value?

fbrendel

fbrendel

2019-01-29 13:51

reporter   ~0003214

Indeed, it does. But it didn't crashed many times before.
Here are some bareos.log lines around the crash:

27-Jan 03:43 elbcg1-sd JobId 2362: User defined maximum volume capacity 107,374,182,400 exceeded on device "FileConDemo7" (/mnt/cephfs/bareos/storage/consolidated/demo/).
27-Jan 03:43 elbcg1-sd JobId 2362: End of medium on Volume "ConDemo-0600" Bytes=107,369,988,277 Blocks=25,599 at 27-Jan-2019 03:43.
27-Jan 03:43 bareos-dir JobId 2362: Created new Volume "ConDemo-0604" in catalog.
27-Jan 03:43 elbcg1-sd: ABORTING due to ERROR in smartall.c:229
Overrun buffer: len=145 addr=7f07dc4369f8 allocated: block.c:123 called from mem_pool.c:257
27-Jan 03:43 bareos-dir JobId 2362: Fatal error: Director's comm line to SD dropped.

27-Jan 04:01 elbcg1-sd JobId 2369: End of Volume at file 24 on device "FileIncDemo2" (/mnt/cephfs/bareos/storage/incremental/demo/), Volume "IncDemo-0125"
27-Jan 04:01 elbcg1-sd JobId 2369: Ready to read from volume "IncDemo-0119" on device "FileIncDemo2" (/mnt/cephfs/bareos/storage/incremental/demo/).
27-Jan 04:01 elbcg1-sd JobId 2369: Forward spacing Volume "IncDemo-0119" to file:block 0:207710437.
27-Jan 04:17 elbcg1-sd JobId 2369: User defined maximum volume capacity 107,374,182,400 exceeded on device "FileConDemo1" (/mnt/cephfs/bareos/storage/consolidated/demo/).
27-Jan 04:17 elbcg1-sd JobId 2369: End of medium on Volume "ConDemo-0605" Bytes=107,369,988,300 Blocks=25,599 at 27-Jan-2019 04:17.
27-Jan 04:17 bareos-dir JobId 2369: Created new Volume "ConDemo-0607" in catalog.
27-Jan 04:17 elbcg1-sd: ABORTING due to ERROR in smartall.c:229
Overrun buffer: len=145 addr=7fe0500a46b8 allocated: block.c:123 called from mem_pool.c:257
27-Jan 04:17 bareos-dir JobId 2369: Fatal error: Director's comm line to SD dropped.
arogge

arogge

2019-01-29 14:15

developer   ~0003215

Seems like it crashes when labeling a new filesystem volume where Maximum Block Size set.
The setting should only be used on tape drives, so using it on filesystem volumes seems to be not tested thoroughly.

I'll try to reproduce the issue.
arogge

arogge

2019-01-29 15:00

developer   ~0003216

Last edited: 2019-01-29 15:00

View 2 revisions

Sorry, I cannot make it crash on my machine with the most recent 17.2 branch or 17.2.7 right now.
Is it correct you are running director and sd in version 17.2.7 on a CentOS 7.6? Are you using our RPM packages?

fbrendel

fbrendel

2019-01-29 15:26

reporter   ~0003217

Yes, it's CentOS 7.6 but with self compiled packages.
I made it with instructions from this site https://www.spheron1.uk/2018/03/14/building-bareos-rpms-on-centos-6-7/
The 18.2.4-rc2 version is from your download area.

The Maximum Block Size was set to speedup the I/O to/from our CephFS to tape (4MB object size and hints from https://www.bareos.org/en/Whitepapers/articles/Speed_Tuning_of_Tape_Drives.html).
I'm not sure if it's still necessary. I have a tape server now with lot's of SSDs for spooling to prevent shoe-shining.

What do you think about running the SD with a reasonable debug level and what would that be?
arogge

arogge

2019-01-29 15:45

developer   ~0003218

Hmm... can you make this happen with our 18.2.4rc2 packages and statistics disabled?
I had another look at the tracebacks and the type of crash changed when you disabled statistics: With stats enabled you were running into a known issue with statistics. With stats disabled, you're now running into this new issue with labeling.

Apart from your crashes:
We recommend setting Maximum Block Size for tape pools only. Even if you set it for your tapes you don't need to set it for your filesystem volumes. Also, if you're running Ceph, there is a RADOS backend in bareos-sd bypassing the need for cephfs interacting with Ceph directly. Did you already try that?
Have you run into read/write-performance issues with your cephfs setup or why did you set Maximum Block Size on your filesystem volumes at all?
fbrendel

fbrendel

2019-01-29 16:03

reporter   ~0003219

With 18.2.4-rc2 it didn't crash with statistics disabled. But I had it only for one day disabled.
I'll install 18.2.4-rc2 and let it run for a while. Is the statistics issue fixed in that version?

I tried the Droplet backend but had no luck.
Is it worth now to try Droplet or RADOS with 18.2.4-rc2?
arogge

arogge

2019-01-29 16:07

developer   ~0003220

Running the SD at another debug level won't help.
What happens is that the buffer for dev->block overruns (i.e. there is a write behind the end of the buffer) and when it is free()d during write_new_volume_label_to_dev()'s call to set_blocksized() the smart allocator kills the SD because it detects the overrun.
I'll need a way to reproduce this and look at it with a debugger. If you can make it crash with our 18.2.4rc2 RPMs, maybe I can make it crash with these, too.

Can you btw. make it crash by issuing a label command in bconsole to label a new volume in one of the Pools with Maximum Block Size set?
fbrendel

fbrendel

2019-01-29 16:23

reporter   ~0003221

No, it doesn't crash while labeling. Unfortunately it crashes sporadically.
fbrendel

fbrendel

2019-01-29 16:41

reporter   ~0003222

Hmm.. can't see any differences with or without statistics.
See bareos.1491023.traceback with statistics and bareos.2834141.traceback without.
Both crashes at set_blocksizes -> free_block etc.
arogge

arogge

2019-01-29 17:31

developer   ~0003223

The very first crash you posted was different. It crashed during sha1_block_data_order_avx() in openssl.
That's why I suggested to test again with 18.2.4rc2 and disabled statistics.
fbrendel

fbrendel

2019-01-29 17:39

reporter   ~0003224

just installed 18.2.4-rc2 and have statistics disabled. :-)
fbrendel

fbrendel

2019-01-30 09:57

reporter   ~0003225

Sorry, but the 18.2.4-rc2 SD crashes too.

Thread 8 (Thread 0x7f3138ff9700 (LWP 2058295)):
#0 0x00007f315d199179 in waitpid () from /lib64/libpthread.so.0
0000001 0x00007f315dace1e4 in SignalHandler (sig=11) at /usr/src/debug/bareos-18.2.4rc2/src/lib/signal.cc:241
0000002 <signal handler called>
0000003 SmartAllocMsg (file=0x7f315daf5a88 "/home/abuild/rpmbuild/BUILD/bareos-18.2.4rc2/src/lib/smartall.cc", line=line@entry=229, fmt=0x7f315daf5b58 "Overrun buffer: len=%d addr=%p allocated: %s:%d called from %s:%d\n", file=0x7f315daf5a88 "/home/abuild/rpmbuild/BUILD/bareos-18.2.4rc2/src/lib/smartall.cc") at /usr/src/debug/bareos-18.2.4rc2/src/lib/smartall.cc:113
0000004 0x00007f315dacef3c in sm_free (file=0x7f315daf2a38 "/home/abuild/rpmbuild/BUILD/bareos-18.2.4rc2/src/lib/mem_pool.cc", line=258, fp=0x7f3114042a48) at /usr/src/debug/bareos-18.2.4rc2/src/lib/smartall.cc:230
0000005 0x00007f315dabee8c in SmFreePoolMemory (fname=fname@entry=0x7f315e1ef280 "/home/abuild/rpmbuild/BUILD/bareos-18.2.4rc2/src/stored/block.cc", lineno=lineno@entry=176, obuf=0x7f3114042a60 "") at /usr/src/debug/bareos-18.2.4rc2/src/lib/mem_pool.cc:258
0000006 0x00007f315e1ca6da in storagedaemon::FreeBlock (block=<optimized out>) at /usr/src/debug/bareos-18.2.4rc2/src/stored/block.cc:176
0000007 0x00007f315e1d0389 in storagedaemon::Device::SetBlocksizes (this=0x7f314c038178, dcr=dcr@entry=0x7f3114052c98) at /usr/src/debug/bareos-18.2.4rc2/src/stored/dev.cc:555
0000008 0x00007f315e1d4e9c in storagedaemon::WriteNewVolumeLabelToDev (dcr=dcr@entry=0x7f3114052c98, VolName=VolName@entry=0x7f3114052da8 "ConDemo-0700", PoolName=PoolName@entry=0x7f3114052e28 "ConDemo", relabel=relabel@entry=false) at /usr/src/debug/bareos-18.2.4rc2/src/stored/label.cc:471
arogge

arogge

2019-01-30 11:41

developer   ~0003226

This is actually "good" news. If our 18.2 packages crash then it is not your build that is broken, but the software itself.
I'll see whether or not the 18.2 packages you've installed are compiled with support for heap corruption detection and come back to you.
arogge

arogge

2019-01-31 15:50

developer   ~0003229

Can you try with the packages from http://download.bareos.org/bareos/people/arogge/smcheck/CentOS_7/x86_64/ ?
If you run "bareos-sd -f -d 200 > /some/logfile" it will run in debug mode with memory corruption detection enabled. The resulting logfile might help me to understand the issue.
fbrendel

fbrendel

2019-02-11 09:43

reporter   ~0003256

With the debug version the SD doesn't crash anymore.
But on the other hand I've changed the config a little because the consolidation jobs failed.

Last night the SD stops responding to the director.
It is still running if you need some debug information of the process.
arogge

arogge

2019-03-13 18:34

developer   ~0003283

If you cannot reproduce the problem anymore, can we close the bug?
fbrendel

fbrendel

2019-03-14 09:11

reporter   ~0003290

I've made an upgrade to 18.2.5. No crashes since then.
So, yes we can close it :-)

Issue History

Date Modified Username Field Change
2019-01-15 10:38 fbrendel New Issue
2019-01-15 10:38 fbrendel File Added: bareos.619590.traceback
2019-01-15 15:39 arogge Assigned To => arogge
2019-01-15 15:39 arogge Status new => assigned
2019-01-15 15:40 arogge Note Added: 0003177
2019-01-15 15:40 arogge Status assigned => feedback
2019-01-16 09:08 fbrendel File Added: bareos.254990.traceback
2019-01-16 09:09 fbrendel Note Added: 0003178
2019-01-16 09:09 fbrendel Status feedback => assigned
2019-01-17 10:51 fbrendel Note Added: 0003203
2019-01-21 10:12 fbrendel File Added: bareos.1491023.traceback
2019-01-21 10:12 fbrendel File Added: bareos.754636.traceback
2019-01-21 10:12 fbrendel Note Added: 0003204
2019-01-28 10:35 fbrendel File Added: bareos.2834141.traceback
2019-01-28 10:35 fbrendel File Added: bareos.3909678.traceback
2019-01-28 10:35 fbrendel Note Added: 0003211
2019-01-28 12:19 arogge Note Added: 0003212
2019-01-29 13:22 arogge Status assigned => feedback
2019-01-29 13:22 arogge Note Added: 0003213
2019-01-29 13:22 arogge Note Edited: 0003213 View Revisions
2019-01-29 13:51 fbrendel Note Added: 0003214
2019-01-29 13:51 fbrendel Status feedback => assigned
2019-01-29 14:15 arogge Note Added: 0003215
2019-01-29 15:00 arogge Status assigned => feedback
2019-01-29 15:00 arogge Note Added: 0003216
2019-01-29 15:00 arogge Note Edited: 0003216 View Revisions
2019-01-29 15:26 fbrendel Note Added: 0003217
2019-01-29 15:26 fbrendel Status feedback => assigned
2019-01-29 15:45 arogge Note Added: 0003218
2019-01-29 16:03 fbrendel Note Added: 0003219
2019-01-29 16:07 arogge Note Added: 0003220
2019-01-29 16:23 fbrendel Note Added: 0003221
2019-01-29 16:41 fbrendel Note Added: 0003222
2019-01-29 17:31 arogge Note Added: 0003223
2019-01-29 17:39 fbrendel Note Added: 0003224
2019-01-30 09:57 fbrendel Note Added: 0003225
2019-01-30 11:41 arogge Note Added: 0003226
2019-01-31 15:50 arogge Note Added: 0003229
2019-01-31 15:50 arogge Status assigned => feedback
2019-02-11 09:43 fbrendel Note Added: 0003256
2019-02-11 09:43 fbrendel Status feedback => assigned
2019-03-13 18:34 arogge Status assigned => feedback
2019-03-13 18:34 arogge Note Added: 0003283
2019-03-14 09:11 fbrendel Note Added: 0003290
2019-03-14 09:11 fbrendel Status feedback => assigned
2019-03-14 09:45 arogge Status assigned => closed
2019-03-14 09:45 arogge Resolution open => unable to reproduce