View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0000522 | bareos-core | storage daemon | public | 2015-09-19 21:06 | 2019-12-18 15:25 |
Reporter | zacha | Assigned To | stephand | ||
Priority | high | Severity | crash | Reproducibility | random |
Status | closed | Resolution | fixed | ||
Platform | Linux | OS | Debian | OS Version | 8 |
Product Version | 14.2.5 | ||||
Summary | 0000522: storage daemon crashes ocassionally when starting a new job | ||||
Description | we have recently upgraded out dir/sd from bacula to bareos 14 and now on the sd keeps crashing. it seems to happen ocassionally when a new job starts. it runs some time with out a problem and then ocassionally crashes. we tried upgrading to 15.x but this did not fix the problem. it looks like following: -either the sd crashes and a traceback is reported -or it hangs in a deadlock not accepting any commands from the dir nor doing anything I don't know if these are both the same error but at least here is a traceback: Created /var/lib/bareos/bareos-sd.core.8572 for doing postmortem debugging [New LWP 8573] [New LWP 8575] [New LWP 13600] [New LWP 8572] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/sbin/bareos-sd'. #0 0x00007f300f8f318d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 81 ../sysdeps/unix/syscall-template.S: No such file or directory. $1 = 0x626700 <my_name> "valerian-sd" $2 = 0x11ad058 "bareos-sd" $3 = 0x11ad098 "/usr/sbin/bareos-sd" $4 = 0x0 $5 = 0x7f3010a055a0 "15.3.0 (24 August 2015)" $6 = 0x7f3010a0558c "x86_64-pc-linux-gnu" $7 = 0x7f3010a05585 "debian" $8 = 0x7f3010a05567 "Debian GNU/Linux 8.0 (jessie)" $9 = "valerian", '\000' <repeats 247 times> $10 = 0x7f3010a05a70 "debian Debian GNU/Linux 8.0 (jessie)" Environment variable "TestName" not defined. #0 0x00007f300f8f318d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 0000001 0x00007f30109d7724 in bmicrosleep (sec=sec@entry=30, usec=usec@entry=0) at bsys.c:173 0000002 0x00007f30109e748c in check_deadlock () at lockmgr.c:566 0000003 0x00007f300f8ec0a4 in start_thread (arg=0x7f300dfef700) at pthread_create.c:309 0000004 0x00007f300e7eb04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 4 (Thread 0x7f301150f740 (LWP 8572)): #0 0x00007f300e7e250d in poll () at ../sysdeps/unix/syscall-template.S:81 0000001 0x00007f30109ccce3 in bnet_thread_server_tcp (addr_list=<optimized out>, max_clients=<optimized out>, sockfds=<optimized out>, client_wq=<optimized out>, nokeepalive=<optimized out>, handle_client_request=<optimized out>) at bnet_server_tcp.c:298 0000002 0x0000000000408c0b in main (argc=18546824, argv=0x1) at stored.c:325 Thread 3 (Thread 0x7f300d7ee700 (LWP 13600)): #0 0x00007f300f8f34c9 in __libc_waitpid (pid=13601, stat_loc=0x7f300d7ececc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:40 0000001 0x00007f30109f5d94 in signal_handler (sig=11) at signal.c:240 0000002 <signal handler called> 0000003 __strcmp_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:30 0000004 0x00007f30109e061d in dlist::binary_search (this=0x11ad2b8, item=0x7f300d7ed930, compare=0x7f3010e62780 <compare_by_volumename(void*, void*)>) at dlist.c:247 0000005 0x00007f3010e646ab in find_volume (VolumeName=<optimized out>) at vol_mgr.c:678 0000006 DCR::can_i_use_volume (this=0x7f3000000e48) at vol_mgr.c:875 0000007 0x000000000040a507 in SD_DCR::dir_find_next_appendable_volume (this=0x7f3000098c58) at askdir.c:303 0000008 0x00007f3010e593c9 in reserve_device (rctx=...) at reserve.c:748 0000009 0x00007f3010e59e82 in search_res_for_device (rctx=...) at reserve.c:576 0000010 0x00007f3010e5a30f in find_suitable_device_for_job (jcr=jcr@entry=0x7f300008f168, rctx=...) at reserve.c:525 0000011 0x00007f3010e5b040 in use_device_cmd (jcr=0x7f300008f168) at reserve.c:301 0000012 use_cmd (jcr=0x7f300008f168) at reserve.c:67 0000013 0x00000000004106ca in handle_director_connection (dir=0x11d0998) at dir_cmd.c:286 0000014 0x0000000000418a67 in handle_connection_request (arg=0x11d0998) at socket_server.c:99 0000015 0x00007f30109ff725 in workq_server (arg=arg@entry=0x626b40 <socket_workq>) at workq.c:335 0000016 0x00007f30109e751f in lmgr_thread_launcher (x=0x11d46f8) at lockmgr.c:926 0000017 0x00007f300f8ec0a4 in start_thread (arg=0x7f300d7ee700) at pthread_create.c:309 #18 0x00007f300e7eb04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 2 (Thread 0x7f300cfed700 (LWP 8575)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 0000001 0x00007f30109e79bc in bthread_cond_timedwait_p (cond=cond@entry=0x7f3010c1ae00 <_ZL5timer>, m=m@entry=0x7f3010c1ae40 <_ZL11timer_mutex>, abstime=abstime@entry=0x7f300cfece20, file=file@entry=0x7f3010a09b4a "watchdog.c", line=line@entry=313) at lockmgr.c:811 0000002 0x00007f30109fedd9 in watchdog_thread (arg=arg@entry=0x0) at watchdog.c:313 0000003 0x00007f30109e751f in lmgr_thread_launcher (x=0x11d1b78) at lockmgr.c:926 0000004 0x00007f300f8ec0a4 in start_thread (arg=0x7f300cfed700) at pthread_create.c:309 0000005 0x00007f300e7eb04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 1 (Thread 0x7f300dfef700 (LWP 8573)): #0 0x00007f300f8f318d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 0000001 0x00007f30109d7724 in bmicrosleep (sec=sec@entry=30, usec=usec@entry=0) at bsys.c:173 0000002 0x00007f30109e748c in check_deadlock () at lockmgr.c:566 0000003 0x00007f300f8ec0a4 in start_thread (arg=0x7f300dfef700) at pthread_create.c:309 0000004 0x00007f300e7eb04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 #0 0x00007f300f8f318d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 81 in ../sysdeps/unix/syscall-template.S No locals. 0000001 0x00007f30109d7724 in bmicrosleep (sec=sec@entry=30, usec=usec@entry=0) at bsys.c:173 173 bsys.c: No such file or directory. timeout = {tv_sec = 30, tv_nsec = 0} tv = {tv_sec = 0, tv_usec = 0} tz = {tz_minuteswest = 234813184, tz_dsttime = 32560} status = <optimized out> 0000002 0x00007f30109e748c in check_deadlock () at lockmgr.c:566 566 lockmgr.c: No such file or directory. __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {0, 8751603639505271763, 0, 139844425756768, 0, 139844369970944, -8713362633105196077, -8713392050317935661}, __mask_was_saved = 0}}, __pad = {0x7f300dfeef30, 0x0, 0x7f300dfef700, 0x7f300dfef700}} __cancel_routine = 0x7f30109e75a0 <cln_hdl(void*)> __not_first_call = <optimized out> old = 0 0000003 0x00007f300f8ec0a4 in start_thread (arg=0x7f300dfef700) at pthread_create.c:309 309 pthread_create.c: No such file or directory. __res = <optimized out> pd = 0x7f300dfef700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139844369970944, 8751603639505271763, 0, 139844425756768, 0, 139844369970944, -8713362633115681837, -8713358921636496429}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread" 0000004 0x00007f300e7eb04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 111 ../sysdeps/unix/sysv/linux/x86_64/clone.S: No such file or directory. No locals. #0 0x0000000000000000 in ?? () No symbol table info available. #0 0x0000000000000000 in ?? () No symbol table info available. #0 0x0000000000000000 in ?? () No symbol table info available. | ||||
Steps To Reproduce | start a new backup or migration job. no matter if one is already running, we have seen it crashing even if it was totally idle before. start the same job again after a restart of the sd and it works perfectly. cannot reproduce it with a specific job or command. | ||||
Additional Information | core file attached | ||||
Tags | No tags attached. | ||||
Have to add that this seems to happen in the middle of a job too, not just upon starting one. I have several core files if they might be interesting. This is really annoying as it not only leads nearly no large backup completing, furthermore in some conditions all the schedule stops when the sd stops responding (hanging in a deadlock). | |
I've seen this traceback on RHEL 6 with 14.2.5, it was also on large full backup run that has been running several hours before the crash: Created /var/lib/bareos/bareos-sd.core.16911 for doing postmortem debugging Missing separate debuginfo for Try: yum --enablerepo='*-debug*' install /usr/lib/debug/.build-id/e7/52c57e2bd5883e5ce1211b21fc5859b4520d90 [New Thread 16912] [New Thread 16914] [New Thread 16008] [New Thread 27562] [New Thread 18120] [New Thread 16911] [Thread debugging using libthread_db enabled] Core was generated by `/usr/sbin/bareos-sd -g bareos -c /etc/bareos/bareos-sd.conf'. #0 0x0000003b1440ef3d in nanosleep () from /lib64/libpthread.so.0 $1 = '\000' <repeats 127 times> $2 = 0x1df1078 "bareos-sd" $3 = 0x1df10b8 "/usr/sbin/bareos-sd" $4 = 0x0 $5 = 0x3f4a45578b "14.2.5 (27 May 2015)" $6 = 0x3f4a4557a0 "x86_64-redhat-linux-gnu" $7 = 0x3f4a4557b8 "redhat" $8 = 0x3f4a455d48 "Red Hat Enterprise Linux Server release 6.5 (Santiago)" ... Thread 4 (Thread 0x7ff05abfd700 (LWP 27562)): #0 0x0000003b1440f2ad in waitpid () from /lib64/libpthread.so.0 0000001 0x0000003f4a44536f in signal_handler (sig=11) at signal.c:240 0000002 <signal handler called> 0000003 0x0000003b1412835a in __strcmp_sse42 () from /lib64/libc.so.6 0000004 0x0000003f4a42f784 in dlist::binary_search (this=0x1df1308, item=0x7ff05abfc220, compare=0x3f4ac34020 <compare_by_volumename(void*, void*)>) at dlist.c:247 0000005 0x0000003f4ac350c3 in find_volume (this=0x7ff050001f28) at vol_mgr.c:650 0000006 DCR::can_i_use_volume (this=0x7ff050001f28) at vol_mgr.c:836 0000007 0x0000003f4ac355d8 in DCR::can_i_write_volume (this=0x7ff050001f28) at vol_mgr.c:821 0000008 0x000000000040af3a in SD_DCR::dir_find_next_appendable_volume (this=0x7ff050001f28) at askdir.c:298 0000009 0x0000003f4ac25289 in DCR::find_a_volume (this=0x7ff050001f28) at mount.c:347 0000010 0x0000003f4ac2551e in DCR::mount_next_write_volume (this=0x7ff050001f28) at mount.c:115 0000011 0x0000003f4ac1ee56 in fixup_device_block_write_error (dcr=0x7ff050001f28, retries=4) at device.c:118 0000012 0x0000003f4ac191b9 in DCR::write_block_to_device (this=0x7ff050001f28) at block.c:408 0000013 0x0000003f4ac285a1 in DCR::write_record (this=0x7ff050001f28) at record.c:470 0000014 0x0000000000408e4e in do_append_data (jcr=0x7ff050001078, bs=0x1e1c7d8, what=0x42022e "FD") at append.c:222 0000015 0x00000000004114f3 in append_data_cmd (jcr=0x7ff050001078) at fd_cmds.c:269 0000016 0x0000000000410cc9 in do_fd_commands (jcr=0x7ff050001078) at fd_cmds.c:225 0000017 0x0000000000411670 in run_job (jcr=0x7ff050001078) at fd_cmds.c:181 #18 0x0000000000412787 in do_job_run (jcr=0x7ff050001078) at job.c:237 #19 0x00000000004108e7 in handle_director_connection (arg=0x1e19418) at dir_cmd.c:287 0000020 handle_connection_request (arg=0x1e19418) at dir_cmd.c:368 0000021 0x0000003f4a44f62d in workq_server (arg=0x6278c0) at workq.c:335 0000022 0x0000003f4a4375ed in lmgr_thread_launcher (x=0x1e18c98) at lockmgr.c:932 #23 0x0000003b144079d1 in start_thread () from /lib64/libpthread.so.0 0000024 0x0000003b140e88fd in clone () from /lib64/libc.so.6 |
|
Stephan could reproduce on master. | |
Fix committed to bareos bareos-15.2 branch with changesetid 5766. | |
Fix committed to bareos bareos-14.2 branch with changesetid 5818. | |
bareos: bareos-15.2 0b6435d7 2015-11-12 18:39 Committer: mvwieringen Ported: N/A Details Diff |
Fix random crashes on sd The block variable was set to the dcr->block, but that can be altered in the call to dev->set_label_blocksize(dcr). When that happens, the code goes on with the wrong block. We removed the whole local variable as it makes no sense and is only referenced 3 times when calling empty_block() Fixes 0000414: Bareos storage daemon crashes during backups Fixse 0000483: bareos-sd crash during backup Fixes 0000522: storage daemon crashes ocassionally when starting a new job Fixes 0000552: SD crashes in -current Signed-off-by: Marco van Wieringen <marco.van.wieringen@bareos.com> |
Affected Issues 0000414, 0000483, 0000522, 0000552, 0000564 |
|
mod - src/stored/label.c | Diff File | ||
bareos: bareos-14.2 3a09212c 2015-11-12 18:39 Committer: mvwieringen Ported: N/A Details Diff |
Fix random crashes on sd The block variable was set to the dcr->block, but that can be altered in the call to dev->set_label_blocksize(dcr). When that happens, the code goes on with the wrong block. We removed the whole local variable as it makes no sense and is only referenced 3 times when calling empty_block() Fixes 0000414: Bareos storage daemon crashes during backups Fixse 0000483: bareos-sd crash during backup Fixes 0000522: storage daemon crashes ocassionally when starting a new job Fixes 0000552: SD crashes in -current Signed-off-by: Marco van Wieringen <marco.van.wieringen@bareos.com> |
Affected Issues 0000414, 0000522, 0000552 |
|
mod - src/stored/label.c | Diff File |
Date Modified | Username | Field | Change |
---|---|---|---|
2015-09-19 21:06 | zacha | New Issue | |
2015-09-19 21:06 | zacha | File Added: bareos-sd.core.rar | |
2015-09-23 10:09 | zacha | Note Added: 0001849 | |
2015-10-26 09:39 | stephand | Note Added: 0001890 | |
2015-11-06 17:04 | maik | Note Added: 0001931 | |
2015-11-06 17:04 | maik | Assigned To | => stephand |
2015-11-06 17:04 | maik | Status | new => confirmed |
2015-11-06 18:30 | maik | Relationship added | child of 0000554 |
2015-11-13 10:19 | stephand | Relationship added | related to 0000414 |
2015-11-13 17:27 | mvwieringen | Changeset attached | => bareos bareos-15.2 0b6435d7 |
2015-11-13 17:27 | mvwieringen | Note Added: 0001961 | |
2015-11-13 17:27 | mvwieringen | Status | confirmed => resolved |
2015-11-13 17:27 | mvwieringen | Resolution | open => fixed |
2015-11-17 12:01 | mvwieringen | Changeset attached | => bareos bareos-14.2 3a09212c |
2015-11-17 12:01 | mvwieringen | Note Added: 0001976 | |
2015-11-30 18:45 | joergs | Relationship added | child of 0000474 |
2019-12-18 15:25 | arogge | Status | resolved => closed |