View Issue Details

IDProjectCategoryView StatusLast Update
0000522bareos-corestorage daemonpublic2019-12-18 15:25
Reporterzacha Assigned Tostephand  
PriorityhighSeveritycrashReproducibilityrandom
Status closedResolutionfixed 
PlatformLinuxOSDebianOS Version8
Product Version14.2.5 
Summary0000522: storage daemon crashes ocassionally when starting a new job
Descriptionwe have recently upgraded out dir/sd from bacula to bareos 14 and now on the sd keeps crashing. it seems to happen ocassionally when a new job starts. it runs some time with out a problem and then ocassionally crashes.

we tried upgrading to 15.x but this did not fix the problem.

it looks like following:

-either the sd crashes and a traceback is reported
-or it hangs in a deadlock not accepting any commands from the dir nor doing anything

I don't know if these are both the same error but at least here is a traceback:

Created /var/lib/bareos/bareos-sd.core.8572 for doing postmortem debugging [New LWP 8573] [New LWP 8575] [New LWP 13600] [New LWP 8572] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/bareos-sd'.
#0 0x00007f300f8f318d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
81 ../sysdeps/unix/syscall-template.S: No such file or directory.
$1 = 0x626700 <my_name> "valerian-sd"
$2 = 0x11ad058 "bareos-sd"
$3 = 0x11ad098 "/usr/sbin/bareos-sd"
$4 = 0x0
$5 = 0x7f3010a055a0 "15.3.0 (24 August 2015)"
$6 = 0x7f3010a0558c "x86_64-pc-linux-gnu"
$7 = 0x7f3010a05585 "debian"
$8 = 0x7f3010a05567 "Debian GNU/Linux 8.0 (jessie)"
$9 = "valerian", '\000' <repeats 247 times>
$10 = 0x7f3010a05a70 "debian Debian GNU/Linux 8.0 (jessie)"
Environment variable "TestName" not defined.
#0 0x00007f300f8f318d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
0000001 0x00007f30109d7724 in bmicrosleep (sec=sec@entry=30, usec=usec@entry=0) at bsys.c:173
0000002 0x00007f30109e748c in check_deadlock () at lockmgr.c:566
0000003 0x00007f300f8ec0a4 in start_thread (arg=0x7f300dfef700) at pthread_create.c:309
0000004 0x00007f300e7eb04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7f301150f740 (LWP 8572)):
#0 0x00007f300e7e250d in poll () at ../sysdeps/unix/syscall-template.S:81
0000001 0x00007f30109ccce3 in bnet_thread_server_tcp (addr_list=<optimized out>, max_clients=<optimized out>, sockfds=<optimized out>, client_wq=<optimized out>, nokeepalive=<optimized out>, handle_client_request=<optimized out>) at bnet_server_tcp.c:298
0000002 0x0000000000408c0b in main (argc=18546824, argv=0x1) at stored.c:325

Thread 3 (Thread 0x7f300d7ee700 (LWP 13600)):
#0 0x00007f300f8f34c9 in __libc_waitpid (pid=13601, stat_loc=0x7f300d7ececc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:40
0000001 0x00007f30109f5d94 in signal_handler (sig=11) at signal.c:240
0000002 <signal handler called>
0000003 __strcmp_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:30
0000004 0x00007f30109e061d in dlist::binary_search (this=0x11ad2b8, item=0x7f300d7ed930, compare=0x7f3010e62780 <compare_by_volumename(void*, void*)>) at dlist.c:247
0000005 0x00007f3010e646ab in find_volume (VolumeName=<optimized out>) at vol_mgr.c:678
0000006 DCR::can_i_use_volume (this=0x7f3000000e48) at vol_mgr.c:875
0000007 0x000000000040a507 in SD_DCR::dir_find_next_appendable_volume (this=0x7f3000098c58) at askdir.c:303
0000008 0x00007f3010e593c9 in reserve_device (rctx=...) at reserve.c:748
0000009 0x00007f3010e59e82 in search_res_for_device (rctx=...) at reserve.c:576
0000010 0x00007f3010e5a30f in find_suitable_device_for_job (jcr=jcr@entry=0x7f300008f168, rctx=...) at reserve.c:525
0000011 0x00007f3010e5b040 in use_device_cmd (jcr=0x7f300008f168) at reserve.c:301
0000012 use_cmd (jcr=0x7f300008f168) at reserve.c:67
0000013 0x00000000004106ca in handle_director_connection (dir=0x11d0998) at dir_cmd.c:286
0000014 0x0000000000418a67 in handle_connection_request (arg=0x11d0998) at socket_server.c:99
0000015 0x00007f30109ff725 in workq_server (arg=arg@entry=0x626b40 <socket_workq>) at workq.c:335
0000016 0x00007f30109e751f in lmgr_thread_launcher (x=0x11d46f8) at lockmgr.c:926
0000017 0x00007f300f8ec0a4 in start_thread (arg=0x7f300d7ee700) at pthread_create.c:309
#18 0x00007f300e7eb04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7f300cfed700 (LWP 8575)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
0000001 0x00007f30109e79bc in bthread_cond_timedwait_p (cond=cond@entry=0x7f3010c1ae00 <_ZL5timer>, m=m@entry=0x7f3010c1ae40 <_ZL11timer_mutex>, abstime=abstime@entry=0x7f300cfece20, file=file@entry=0x7f3010a09b4a "watchdog.c", line=line@entry=313) at lockmgr.c:811
0000002 0x00007f30109fedd9 in watchdog_thread (arg=arg@entry=0x0) at watchdog.c:313
0000003 0x00007f30109e751f in lmgr_thread_launcher (x=0x11d1b78) at lockmgr.c:926
0000004 0x00007f300f8ec0a4 in start_thread (arg=0x7f300cfed700) at pthread_create.c:309
0000005 0x00007f300e7eb04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f300dfef700 (LWP 8573)):
#0 0x00007f300f8f318d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
0000001 0x00007f30109d7724 in bmicrosleep (sec=sec@entry=30, usec=usec@entry=0) at bsys.c:173
0000002 0x00007f30109e748c in check_deadlock () at lockmgr.c:566
0000003 0x00007f300f8ec0a4 in start_thread (arg=0x7f300dfef700) at pthread_create.c:309
0000004 0x00007f300e7eb04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
#0 0x00007f300f8f318d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
81 in ../sysdeps/unix/syscall-template.S
No locals.
0000001 0x00007f30109d7724 in bmicrosleep (sec=sec@entry=30, usec=usec@entry=0) at bsys.c:173
173 bsys.c: No such file or directory.
timeout = {tv_sec = 30, tv_nsec = 0}
tv = {tv_sec = 0, tv_usec = 0}
tz = {tz_minuteswest = 234813184, tz_dsttime = 32560} status = <optimized out>
0000002 0x00007f30109e748c in check_deadlock () at lockmgr.c:566
566 lockmgr.c: No such file or directory.
__cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {0, 8751603639505271763, 0, 139844425756768, 0, 139844369970944, -8713362633105196077, -8713392050317935661}, __mask_was_saved = 0}}, __pad = {0x7f300dfeef30, 0x0, 0x7f300dfef700, 0x7f300dfef700}} __cancel_routine = 0x7f30109e75a0 <cln_hdl(void*)> __not_first_call = <optimized out> old = 0
0000003 0x00007f300f8ec0a4 in start_thread (arg=0x7f300dfef700) at pthread_create.c:309
309 pthread_create.c: No such file or directory.
__res = <optimized out>
pd = 0x7f300dfef700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139844369970944, 8751603639505271763, 0, 139844425756768, 0, 139844369970944, -8713362633115681837, -8713358921636496429}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
0000004 0x00007f300e7eb04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
111 ../sysdeps/unix/sysv/linux/x86_64/clone.S: No such file or directory.
No locals.
#0 0x0000000000000000 in ?? ()
No symbol table info available.
#0 0x0000000000000000 in ?? ()
No symbol table info available.
#0 0x0000000000000000 in ?? ()
No symbol table info available.
Steps To Reproducestart a new backup or migration job. no matter if one is already running, we have seen it crashing even if it was totally idle before.

start the same job again after a restart of the sd and it works perfectly. cannot reproduce it with a specific job or command.
Additional Informationcore file attached
TagsNo tags attached.

Relationships

related to 0000414 closed Bareos storage daemon crashes during backups 
child of 0000554 closed Release bareos-15.2.2 
child of 0000474 closed Release bareos-14.2.6 

Activities

zacha

zacha

2015-09-19 21:06

reporter  

bareos-sd.core.rar (447,423 bytes)
zacha

zacha

2015-09-23 10:09

reporter   ~0001849

Have to add that this seems to happen in the middle of a job too, not just upon starting one. I have several core files if they might be interesting. This is really annoying as it not only leads nearly no large backup completing, furthermore in some conditions all the schedule stops when the sd stops responding (hanging in a deadlock).
stephand

stephand

2015-10-26 09:39

developer   ~0001890

I've seen this traceback on RHEL 6 with 14.2.5, it was also on large full backup run that has been running several hours before the crash:

Created /var/lib/bareos/bareos-sd.core.16911 for doing postmortem debugging
Missing separate debuginfo for
Try: yum --enablerepo='*-debug*' install /usr/lib/debug/.build-id/e7/52c57e2bd5883e5ce1211b21fc5859b4520d90
[New Thread 16912]
[New Thread 16914]
[New Thread 16008]
[New Thread 27562]
[New Thread 18120]
[New Thread 16911]
[Thread debugging using libthread_db enabled]
Core was generated by `/usr/sbin/bareos-sd -g bareos -c /etc/bareos/bareos-sd.conf'.
#0 0x0000003b1440ef3d in nanosleep () from /lib64/libpthread.so.0
$1 = '\000' <repeats 127 times>
$2 = 0x1df1078 "bareos-sd"
$3 = 0x1df10b8 "/usr/sbin/bareos-sd"
$4 = 0x0
$5 = 0x3f4a45578b "14.2.5 (27 May 2015)"
$6 = 0x3f4a4557a0 "x86_64-redhat-linux-gnu"
$7 = 0x3f4a4557b8 "redhat"
$8 = 0x3f4a455d48 "Red Hat Enterprise Linux Server release 6.5 (Santiago)"
...
Thread 4 (Thread 0x7ff05abfd700 (LWP 27562)):
#0 0x0000003b1440f2ad in waitpid () from /lib64/libpthread.so.0
0000001 0x0000003f4a44536f in signal_handler (sig=11) at signal.c:240
0000002 <signal handler called>
0000003 0x0000003b1412835a in __strcmp_sse42 () from /lib64/libc.so.6
0000004 0x0000003f4a42f784 in dlist::binary_search (this=0x1df1308, item=0x7ff05abfc220, compare=0x3f4ac34020 <compare_by_volumename(void*, void*)>) at dlist.c:247
0000005 0x0000003f4ac350c3 in find_volume (this=0x7ff050001f28) at vol_mgr.c:650
0000006 DCR::can_i_use_volume (this=0x7ff050001f28) at vol_mgr.c:836
0000007 0x0000003f4ac355d8 in DCR::can_i_write_volume (this=0x7ff050001f28) at vol_mgr.c:821
0000008 0x000000000040af3a in SD_DCR::dir_find_next_appendable_volume (this=0x7ff050001f28) at askdir.c:298
0000009 0x0000003f4ac25289 in DCR::find_a_volume (this=0x7ff050001f28) at mount.c:347
0000010 0x0000003f4ac2551e in DCR::mount_next_write_volume (this=0x7ff050001f28) at mount.c:115
0000011 0x0000003f4ac1ee56 in fixup_device_block_write_error (dcr=0x7ff050001f28, retries=4) at device.c:118
0000012 0x0000003f4ac191b9 in DCR::write_block_to_device (this=0x7ff050001f28) at block.c:408
0000013 0x0000003f4ac285a1 in DCR::write_record (this=0x7ff050001f28) at record.c:470
0000014 0x0000000000408e4e in do_append_data (jcr=0x7ff050001078, bs=0x1e1c7d8, what=0x42022e "FD") at append.c:222
0000015 0x00000000004114f3 in append_data_cmd (jcr=0x7ff050001078) at fd_cmds.c:269
0000016 0x0000000000410cc9 in do_fd_commands (jcr=0x7ff050001078) at fd_cmds.c:225
0000017 0x0000000000411670 in run_job (jcr=0x7ff050001078) at fd_cmds.c:181
#18 0x0000000000412787 in do_job_run (jcr=0x7ff050001078) at job.c:237
#19 0x00000000004108e7 in handle_director_connection (arg=0x1e19418) at dir_cmd.c:287
0000020 handle_connection_request (arg=0x1e19418) at dir_cmd.c:368
0000021 0x0000003f4a44f62d in workq_server (arg=0x6278c0) at workq.c:335
0000022 0x0000003f4a4375ed in lmgr_thread_launcher (x=0x1e18c98) at lockmgr.c:932
#23 0x0000003b144079d1 in start_thread () from /lib64/libpthread.so.0
0000024 0x0000003b140e88fd in clone () from /lib64/libc.so.6
maik

maik

2015-11-06 17:04

administrator   ~0001931

Stephan could reproduce on master.
mvwieringen

mvwieringen

2015-11-13 17:27

developer   ~0001961

Fix committed to bareos bareos-15.2 branch with changesetid 5766.
mvwieringen

mvwieringen

2015-11-17 12:01

developer   ~0001976

Fix committed to bareos bareos-14.2 branch with changesetid 5818.

Related Changesets

bareos: bareos-15.2 0b6435d7

2015-11-12 18:39

pstorz


Committer: mvwieringen

Ported: N/A

Details Diff
Fix random crashes on sd

The block variable was set to the dcr->block, but that can be altered in
the call to dev->set_label_blocksize(dcr).

When that happens, the code goes on with the wrong block.
We removed the whole local variable as it makes no sense and is only
referenced 3 times when calling empty_block()

Fixes 0000414: Bareos storage daemon crashes during backups
Fixse 0000483: bareos-sd crash during backup
Fixes 0000522: storage daemon crashes ocassionally when starting a new job
Fixes 0000552: SD crashes in -current

Signed-off-by: Marco van Wieringen <marco.van.wieringen@bareos.com>
Affected Issues
0000414, 0000483, 0000522, 0000552, 0000564
mod - src/stored/label.c Diff File

bareos: bareos-14.2 3a09212c

2015-11-12 18:39

pstorz


Committer: mvwieringen

Ported: N/A

Details Diff
Fix random crashes on sd

The block variable was set to the dcr->block, but that can be altered in
the call to dev->set_label_blocksize(dcr).

When that happens, the code goes on with the wrong block.
We removed the whole local variable as it makes no sense and is only
referenced 3 times when calling empty_block()

Fixes 0000414: Bareos storage daemon crashes during backups
Fixse 0000483: bareos-sd crash during backup
Fixes 0000522: storage daemon crashes ocassionally when starting a new job
Fixes 0000552: SD crashes in -current

Signed-off-by: Marco van Wieringen <marco.van.wieringen@bareos.com>
Affected Issues
0000414, 0000522, 0000552
mod - src/stored/label.c Diff File

Issue History

Date Modified Username Field Change
2015-09-19 21:06 zacha New Issue
2015-09-19 21:06 zacha File Added: bareos-sd.core.rar
2015-09-23 10:09 zacha Note Added: 0001849
2015-10-26 09:39 stephand Note Added: 0001890
2015-11-06 17:04 maik Note Added: 0001931
2015-11-06 17:04 maik Assigned To => stephand
2015-11-06 17:04 maik Status new => confirmed
2015-11-06 18:30 maik Relationship added child of 0000554
2015-11-13 10:19 stephand Relationship added related to 0000414
2015-11-13 17:27 mvwieringen Changeset attached => bareos bareos-15.2 0b6435d7
2015-11-13 17:27 mvwieringen Note Added: 0001961
2015-11-13 17:27 mvwieringen Status confirmed => resolved
2015-11-13 17:27 mvwieringen Resolution open => fixed
2015-11-17 12:01 mvwieringen Changeset attached => bareos bareos-14.2 3a09212c
2015-11-17 12:01 mvwieringen Note Added: 0001976
2015-11-30 18:45 joergs Relationship added child of 0000474
2019-12-18 15:25 arogge Status resolved => closed