View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0000276 | bareos-core | storage daemon | public | 2014-02-11 10:52 | 2015-03-25 19:19 |
Reporter | mvwieringen | Assigned To | |||
Priority | normal | Severity | major | Reproducibility | random |
Status | closed | Resolution | fixed | ||
Platform | Linux | OS | any | OS Version | 3 |
Product Version | 13.2.2 | ||||
Summary | 0000276: SD to SD replication makes SD crash | ||||
Description | We are having issues with Copy jobs from one SD (primary) to a backup site SD (backup). Each night, after the primary backups are done, the jobs are copied with a Copy job to another SD running off site. Randomly, the SD segfaults without notice during the process. It can be in the first jobs or the lasts. Also, it can run for some days with no crash. I compiled the latest Git version (january 30) and the problem is still there. I have ran the SD with GDB and it crashed this weekend. I have attached the output in the post. I'm not very good in debugging but it seems the jcr->store_bsock is not initialized (sd = address 0x0) and it tries to access a method of the class (sd->set_jcr). It's causing us a lot of problems since I have to restart it each time and transfer the jobs again by hand. | ||||
Additional Information | Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff32d0700 (LWP 31752)] 0x0000000000429fea in BSOCK::set_jcr (this=0x0, jcr=0x6a2728) at ../lib/bsock.h:133 133 void set_jcr(JCR *jcr) { m_jcr = jcr; }; (gdb) bt #0 0x0000000000429fea in BSOCK::set_jcr (this=0x0, jcr=0x6a2728) at ../lib/bsock.h:133 0000001 0x0000000000441d86 in do_sd_commands (jcr=0x6a2728) at sd_cmds.c:156 0000002 0x00000000004424d5 in do_listen_run (jcr=0x6a2728) at sd_cmds.c:271 0000003 0x000000000042951d in listen_cmd (jcr=0x6a2728) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x69d578, job_name=0x7ffff32cfad0 "BackupFreeswitchMysql.2014-02-06_10.42.35_01") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x69d578) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x698e88) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () (gdb) thread apply all bt Thread 711 (Thread 0x7ffff12cc700 (LWP 31759)): #0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0 0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x71da78, m=0x670fa0 <_ZL5mutex>, abstime=0x7ffff12cb980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817 0000002 0x0000000000442210 in do_listen_run (jcr=0x71d748) at sd_cmds.c:243 0000003 0x000000000042951d in listen_cmd (jcr=0x71d748) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x71bce8, job_name=0x7ffff12cbad0 "") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x71bce8) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x70bba8) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () Thread 710 (Thread 0x7ffff42d2700 (LWP 31756)): #0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0 0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x6ab0b8, m=0x670fa0 <_ZL5mutex>, abstime=0x7ffff42d1980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817 0000002 0x0000000000442210 in do_listen_run (jcr=0x6aad88) at sd_cmds.c:243 0000003 0x000000000042951d in listen_cmd (jcr=0x6aad88) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x6a9328, job_name=0x7ffff42d1ad0 "") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x6a9328) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x6a8008) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () Thread 709 (Thread 0x7ffff2acf700 (LWP 31755)): #0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0 0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x6a7928, m=0x670fa0 <_ZL5mutex>, abstime=0x7ffff2ace980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817 0000002 0x0000000000442210 in do_listen_run (jcr=0x6a75f8) at sd_cmds.c:243 0000003 0x000000000042951d in listen_cmd (jcr=0x6a75f8) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x6a5cc8, job_name=0x7ffff2acead0 "BackupCanisMinorMysql.2014-02-06_10.42.19_09") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x6a5cc8) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x6a36e8) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () Thread 708 (Thread 0x7ffff1acd700 (LWP 31754)): #0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0 0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x6a4f48, m=0x670fa0 <_ZL5mutex>, abstime=0x7ffff1acc980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817 0000002 0x0000000000442210 in do_listen_run (jcr=0x6a4c18) at sd_cmds.c:243 0000003 0x000000000042951d in listen_cmd (jcr=0x6a4c18) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x69f008, job_name=0x7ffff1accad0 "BackupBaseFilesGalileo.2014-02-06_10.42.20_15") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x69f008) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x6a35f8) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () Thread 707 (Thread 0x7ffff22ce700 (LWP 31753)): ---Type <return> to continue, or q <return> to quit--- #0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0 0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x69c368, m=0x670fa0 <_ZL5mutex>, abstime=0x7ffff22cd980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817 0000002 0x0000000000442210 in do_listen_run (jcr=0x69c038) at sd_cmds.c:243 0000003 0x000000000042951d in listen_cmd (jcr=0x69c038) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x6a1368, job_name=0x7ffff22cdad0 "BackupCatalog.2014-02-07_08.00.00_26") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x6a1368) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x69ae78) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () Thread 706 (Thread 0x7ffff32d0700 (LWP 31752)): #0 0x0000000000429fea in BSOCK::set_jcr (this=0x0, jcr=0x6a2728) at ../lib/bsock.h:133 0000001 0x0000000000441d86 in do_sd_commands (jcr=0x6a2728) at sd_cmds.c:156 0000002 0x00000000004424d5 in do_listen_run (jcr=0x6a2728) at sd_cmds.c:271 0000003 0x000000000042951d in listen_cmd (jcr=0x6a2728) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x69d578, job_name=0x7ffff32cfad0 "BackupFreeswitchMysql.2014-02-06_10.42.35_01") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x69d578) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x698e88) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () Thread 4 (Thread 0x7ffff3ad1700 (LWP 31760)): #0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0 0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x7ffff7926800 <_ZL5timer>, m=0x7ffff79267c0 <_ZL11timer_mutex>, abstime=0x7ffff3ad0d80, file=0x7ffff7717bea "watchdog.c", line=313) at lockmgr.c:817 0000002 0x00007ffff770d524 in watchdog_thread (arg=0x0) at watchdog.c:313 0000003 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x698e88) at lockmgr.c:932 0000004 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000005 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000006 0x0000000000000000 in ?? () Thread 2 (Thread 0x7ffff4ad3700 (LWP 31758)): #0 0x00007ffff687811d in nanosleep () from /lib/libpthread.so.0 0000001 0x00007ffff76d93cc in bmicrosleep (sec=30, usec=0) at bsys.c:100 0000002 0x00007ffff76ec9b1 in check_deadlock () at lockmgr.c:567 0000003 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000004 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000005 0x0000000000000000 in ?? () Thread 1 (Thread 0x7ffff7fe8720 (LWP 31757)): #0 0x00007ffff57fda33 in poll () from /lib/libc.so.6 0000001 0x00007ffff76ca41c in bnet_thread_server_tcp (addr_list=0x6810f8, max_clients=41, sockfds=0x69aba8, client_wq=0x671320 <dird_workq>, nokeepalive=false, handle_client_request=0x4253bc <handle_connection_request(void*)>) at bnet_server_tcp.c:298 0000002 0x000000000044c80d in main (argc=0, argv=0x7fffffffe7c8) at stored.c:306 | ||||
Tags | No tags attached. | ||||
First analyzing this shows that it fails in do_listen_run() Problem is that we wait up to 30 minutes for the SD-SD replication job to start but it could be that the total migration/copy run takes longer then 30 minutes to finish all jobs. The current code doesn't take this into consideration and doesn't check if there is actually a remote SD that connected for doing the replication. First quick stopgap is adding a check if there really is an authenticated storage daemon connection. |
|
0001-Don-t-continue-when-sd-replication-socket-is-not-con.patch (1,031 bytes)
From c90ddfd3cfe7afe959f94d22d476d4a8b3d44504 Mon Sep 17 00:00:00 2001 From: Marco van Wieringen <marco.van.wieringen@bareos.com> Date: Tue, 11 Feb 2014 09:49:56 +0100 Subject: [PATCH] Don't continue when sd replication socket is not connected. --- src/stored/sd_cmds.c | 10 +++++++++- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/src/stored/sd_cmds.c b/src/stored/sd_cmds.c index 26a0fe6..7995ff4 100644 --- a/src/stored/sd_cmds.c +++ b/src/stored/sd_cmds.c @@ -249,7 +249,15 @@ bool do_listen_run(JCR *jcr) Dmsg3(50, "Auth=%d canceled=%d errstat=%d\n", jcr->authenticated, job_canceled(jcr), errstat); V(mutex); - Dmsg2(800, "Auth fail or cancel for jid=%d %p\n", jcr->JobId, jcr); + + if (!jcr->authenticated || !jcr->store_bsock) { + Dmsg2(800, "Auth fail or cancel for jid=%d %p\n", jcr->JobId, jcr); + + /* + * After a listen cmd we are done e.g. return false. + */ + return false; + } Dmsg1(120, "Start run Job=%s\n", jcr->Job); -- 1.7.3.2 |
|
Fix committed to bareos master branch with changesetid 1489. | |
Problem seems to be solved but now the SD jobs hang while the DIR aborts the long running jobs. |
|
Fix committed to bareos bareos-13.2 branch with changesetid 1536. | |
never got feedback, assuming that it works now. No more problems reported. | |
Fix committed to bareos2015 bareos-14.2 branch with changesetid 4915. | |
Due to the reimport of the Github repository to bugs.bareos.org, the status of some tickets have been changed. These tickets will be closed again. Sorry for the noise. |
|
bareos: master cf07f93d 2014-02-11 09:49 Ported: N/A Details Diff |
SD to SD replication makes SD crash Problem is that we wait up to 30 minutes for the SD-SD replication job to start but it could be that the total migration/copy run takes longer then 30 minutes to finish all jobs. The current code doesn't take this into consideration and doesn't check if there is actually a remote SD that connected for doing the replication. So fix is to not continue when sd replication socket is not connected. And also remove the timeout on starting a SD-SD replication session. The normal FD-SD connection is protected with a timeout so we don't hang when a FD never connects. As we support canceling a storage Job from the director and the director should cleanly cancel the storage job any way when it fails the copy or migration job it should be no problem. In a normal backup/restore there are 3 daemons involved e.g. director, storage daemon and file daemon but with migration the director controls everything and controls the at maximum two storage daemons. Fixes 0000276: SD to SD replication makes SD crash |
Affected Issues 0000276 |
|
mod - src/stored/dir_cmd.c | Diff File | ||
mod - src/stored/sd_cmds.c | Diff File | ||
bareos: bareos-13.2 99c7b1ed 2014-02-11 09:49 Ported: N/A Details Diff |
SD to SD replication makes SD crash Problem is that we wait up to 30 minutes for the SD-SD replication job to start but it could be that the total migration/copy run takes longer then 30 minutes to finish all jobs. The current code doesn't take this into consideration and doesn't check if there is actually a remote SD that connected for doing the replication. So fix is to not continue when sd replication socket is not connected. And also remove the timeout on starting a SD-SD replication session. The normal FD-SD connection is protected with a timeout so we don't hang when a FD never connects. As we support canceling a storage Job from the director and the director should cleanly cancel the storage job any way when it fails the copy or migration job it should be no problem. In a normal backup/restore there are 3 daemons involved e.g. director, storage daemon and file daemon but with migration the director controls everything and controls the at maximum two storage daemons. Fixes 0000276: SD to SD replication makes SD crash |
Affected Issues 0000276 |
|
mod - src/stored/dir_cmd.c | Diff File | ||
mod - src/stored/sd_cmds.c | Diff File | ||
bareos2015: bareos-13.2 af148891 2014-02-11 10:49 Ported: N/A Details Diff |
SD to SD replication makes SD crash Problem is that we wait up to 30 minutes for the SD-SD replication job to start but it could be that the total migration/copy run takes longer then 30 minutes to finish all jobs. The current code doesn't take this into consideration and doesn't check if there is actually a remote SD that connected for doing the replication. So fix is to not continue when sd replication socket is not connected. And also remove the timeout on starting a SD-SD replication session. The normal FD-SD connection is protected with a timeout so we don't hang when a FD never connects. As we support canceling a storage Job from the director and the director should cleanly cancel the storage job any way when it fails the copy or migration job it should be no problem. In a normal backup/restore there are 3 daemons involved e.g. director, storage daemon and file daemon but with migration the director controls everything and controls the at maximum two storage daemons. Fixes 0000276: SD to SD replication makes SD crash |
Affected Issues 0000276 |
|
mod - src/stored/dir_cmd.c | Diff File | ||
mod - src/stored/sd_cmds.c | Diff File | ||
bareos2015: bareos-14.2 f1a80357 2014-02-11 10:49 Ported: N/A Details Diff |
SD to SD replication makes SD crash Problem is that we wait up to 30 minutes for the SD-SD replication job to start but it could be that the total migration/copy run takes longer then 30 minutes to finish all jobs. The current code doesn't take this into consideration and doesn't check if there is actually a remote SD that connected for doing the replication. So fix is to not continue when sd replication socket is not connected. And also remove the timeout on starting a SD-SD replication session. The normal FD-SD connection is protected with a timeout so we don't hang when a FD never connects. As we support canceling a storage Job from the director and the director should cleanly cancel the storage job any way when it fails the copy or migration job it should be no problem. In a normal backup/restore there are 3 daemons involved e.g. director, storage daemon and file daemon but with migration the director controls everything and controls the at maximum two storage daemons. Fixes 0000276: SD to SD replication makes SD crash |
Affected Issues 0000276 |
|
mod - src/stored/dir_cmd.c | Diff File | ||
mod - src/stored/sd_cmds.c | Diff File |
Date Modified | Username | Field | Change |
---|---|---|---|
2014-02-11 10:52 | mvwieringen | New Issue | |
2014-02-11 10:52 | mvwieringen | Status | new => assigned |
2014-02-11 10:52 | mvwieringen | Assigned To | => mvwieringen |
2014-02-11 11:01 | mvwieringen | Note Added: 0000809 | |
2014-02-11 11:01 | mvwieringen | File Added: 0001-Don-t-continue-when-sd-replication-socket-is-not-con.patch | |
2014-02-11 16:59 | mvwieringen | Changeset attached | => bareos master cf07f93d |
2014-02-11 16:59 | mvwieringen | Note Added: 0000812 | |
2014-02-11 16:59 | mvwieringen | Status | assigned => resolved |
2014-02-11 16:59 | mvwieringen | Resolution | open => fixed |
2014-02-12 19:48 | mvwieringen | Note Added: 0000815 | |
2014-02-12 19:48 | mvwieringen | Status | resolved => feedback |
2014-02-12 19:48 | mvwieringen | Resolution | fixed => reopened |
2014-02-25 17:21 | mvwieringen | Changeset attached | => bareos bareos-13.2 99c7b1ed |
2014-02-25 17:21 | mvwieringen | Note Added: 0000823 | |
2014-02-25 17:21 | mvwieringen | Status | feedback => assigned |
2014-02-25 17:21 | mvwieringen | Status | assigned => resolved |
2014-02-28 18:49 | mvwieringen | Status | resolved => feedback |
2014-05-16 17:30 |
|
Assigned To | mvwieringen => |
2014-09-05 16:49 | maik | Note Added: 0000970 | |
2014-09-05 16:49 | maik | Status | feedback => closed |
2014-09-05 16:49 | maik | Assigned To | => maik |
2014-09-05 16:49 | maik | Resolution | reopened => no change required |
2015-01-26 16:23 |
|
Assigned To | maik => |
2015-03-25 16:51 | mvwieringen | Changeset attached | => bareos2015 bareos-13.2 af148891 |
2015-03-25 16:51 | mvwieringen | Changeset attached | => bareos2015 bareos-14.2 f1a80357 |
2015-03-25 16:51 | mvwieringen | Note Added: 0001412 | |
2015-03-25 16:51 | mvwieringen | Status | closed => resolved |
2015-03-25 16:51 | mvwieringen | Resolution | no change required => fixed |
2015-03-25 19:19 | joergs | Note Added: 0001565 | |
2015-03-25 19:19 | joergs | Status | resolved => closed |