0000276: SD to SD replication makes SD crash

ID	Project	Category	View Status	Date Submitted	Last Update

0000276	bareos-core	storage daemon	public	2014-02-11 10:52	2015-03-25 19:19

Reporter	mvwieringen	Assigned To
Priority	normal	Severity	major	Reproducibility	random
Status	closed	Resolution	fixed
Platform	Linux	OS	any	OS Version	3
Product Version	13.2.2

Summary	0000276: SD to SD replication makes SD crash
Description	We are having issues with Copy jobs from one SD (primary) to a backup site SD (backup). Each night, after the primary backups are done, the jobs are copied with a Copy job to another SD running off site. Randomly, the SD segfaults without notice during the process. It can be in the first jobs or the lasts. Also, it can run for some days with no crash. I compiled the latest Git version (january 30) and the problem is still there. I have ran the SD with GDB and it crashed this weekend. I have attached the output in the post. I'm not very good in debugging but it seems the jcr->store_bsock is not initialized (sd = address 0x0) and it tries to access a method of the class (sd->set_jcr). It's causing us a lot of problems since I have to restart it each time and transfer the jobs again by hand.
Additional Information	Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff32d0700 (LWP 31752)] 0x0000000000429fea in BSOCK::set_jcr (this=0x0, jcr=0x6a2728) at ../lib/bsock.h:133 133 void set_jcr(JCR jcr) { m_jcr = jcr; }; (gdb) bt #0 0x0000000000429fea in BSOCK::set_jcr (this=0x0, jcr=0x6a2728) at ../lib/bsock.h:133 0000001 0x0000000000441d86 in do_sd_commands (jcr=0x6a2728) at sd_cmds.c:156 0000002 0x00000000004424d5 in do_listen_run (jcr=0x6a2728) at sd_cmds.c:271 0000003 0x000000000042951d in listen_cmd (jcr=0x6a2728) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x69d578, job_name=0x7ffff32cfad0 "BackupFreeswitchMysql.2014-02-06_10.42.35_01") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x69d578) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x698e88) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () (gdb) thread apply all bt Thread 711 (Thread 0x7ffff12cc700 (LWP 31759)): #0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0 0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x71da78, m=0x670fa0 <_ZL5mutex>, abstime=0x7ffff12cb980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817 0000002 0x0000000000442210 in do_listen_run (jcr=0x71d748) at sd_cmds.c:243 0000003 0x000000000042951d in listen_cmd (jcr=0x71d748) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x71bce8, job_name=0x7ffff12cbad0 "") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x71bce8) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x70bba8) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () Thread 710 (Thread 0x7ffff42d2700 (LWP 31756)): #0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0 0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x6ab0b8, m=0x670fa0 <_ZL5mutex>, abstime=0x7ffff42d1980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817 0000002 0x0000000000442210 in do_listen_run (jcr=0x6aad88) at sd_cmds.c:243 0000003 0x000000000042951d in listen_cmd (jcr=0x6aad88) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x6a9328, job_name=0x7ffff42d1ad0 "") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x6a9328) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x6a8008) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () Thread 709 (Thread 0x7ffff2acf700 (LWP 31755)): #0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0 0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x6a7928, m=0x670fa0 <_ZL5mutex>, abstime=0x7ffff2ace980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817 0000002 0x0000000000442210 in do_listen_run (jcr=0x6a75f8) at sd_cmds.c:243 0000003 0x000000000042951d in listen_cmd (jcr=0x6a75f8) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x6a5cc8, job_name=0x7ffff2acead0 "BackupCanisMinorMysql.2014-02-06_10.42.19_09") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x6a5cc8) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x6a36e8) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () Thread 708 (Thread 0x7ffff1acd700 (LWP 31754)): #0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0 0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x6a4f48, m=0x670fa0 <_ZL5mutex>, abstime=0x7ffff1acc980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817 0000002 0x0000000000442210 in do_listen_run (jcr=0x6a4c18) at sd_cmds.c:243 0000003 0x000000000042951d in listen_cmd (jcr=0x6a4c18) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x69f008, job_name=0x7ffff1accad0 "BackupBaseFilesGalileo.2014-02-06_10.42.20_15") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x69f008) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x6a35f8) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () Thread 707 (Thread 0x7ffff22ce700 (LWP 31753)): ---Type <return> to continue, or q <return> to quit--- #0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0 0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x69c368, m=0x670fa0 <_ZL5mutex>, abstime=0x7ffff22cd980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817 0000002 0x0000000000442210 in do_listen_run (jcr=0x69c038) at sd_cmds.c:243 0000003 0x000000000042951d in listen_cmd (jcr=0x69c038) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x6a1368, job_name=0x7ffff22cdad0 "BackupCatalog.2014-02-07_08.00.00_26") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x6a1368) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x69ae78) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () Thread 706 (Thread 0x7ffff32d0700 (LWP 31752)): #0 0x0000000000429fea in BSOCK::set_jcr (this=0x0, jcr=0x6a2728) at ../lib/bsock.h:133 0000001 0x0000000000441d86 in do_sd_commands (jcr=0x6a2728) at sd_cmds.c:156 0000002 0x00000000004424d5 in do_listen_run (jcr=0x6a2728) at sd_cmds.c:271 0000003 0x000000000042951d in listen_cmd (jcr=0x6a2728) at dir_cmd.c:1501 0000004 0x0000000000425251 in handle_director_connection (dir=0x69d578, job_name=0x7ffff32cfad0 "BackupFreeswitchMysql.2014-02-06_10.42.35_01") at dir_cmd.c:273 0000005 0x000000000042572c in handle_connection_request (arg=0x69d578) at dir_cmd.c:356 0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335 0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x698e88) at lockmgr.c:932 0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000010 0x0000000000000000 in ?? () Thread 4 (Thread 0x7ffff3ad1700 (LWP 31760)): #0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0 0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x7ffff7926800 <_ZL5timer>, m=0x7ffff79267c0 <_ZL11timer_mutex>, abstime=0x7ffff3ad0d80, file=0x7ffff7717bea "watchdog.c", line=313) at lockmgr.c:817 0000002 0x00007ffff770d524 in watchdog_thread (arg=0x0) at watchdog.c:313 0000003 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x698e88) at lockmgr.c:932 0000004 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000005 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000006 0x0000000000000000 in ?? () Thread 2 (Thread 0x7ffff4ad3700 (LWP 31758)): #0 0x00007ffff687811d in nanosleep () from /lib/libpthread.so.0 0000001 0x00007ffff76d93cc in bmicrosleep (sec=30, usec=0) at bsys.c:100 0000002 0x00007ffff76ec9b1 in check_deadlock () at lockmgr.c:567 0000003 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0 0000004 0x00007ffff580a1cd in clone () from /lib/libc.so.6 0000005 0x0000000000000000 in ?? () Thread 1 (Thread 0x7ffff7fe8720 (LWP 31757)): #0 0x00007ffff57fda33 in poll () from /lib/libc.so.6 0000001 0x00007ffff76ca41c in bnet_thread_server_tcp (addr_list=0x6810f8, max_clients=41, sockfds=0x69aba8, client_wq=0x671320 <dird_workq>, nokeepalive=false, handle_client_request=0x4253bc <handle_connection_request(void)>) at bnet_server_tcp.c:298 0000002 0x000000000044c80d in main (argc=0, argv=0x7fffffffe7c8) at stored.c:306
Tags	No tags attached.

mvwieringen 2014-02-11 11:01 developer ~0000809	First analyzing this shows that it fails in do_listen_run() Problem is that we wait up to 30 minutes for the SD-SD replication job to start but it could be that the total migration/copy run takes longer then 30 minutes to finish all jobs. The current code doesn't take this into consideration and doesn't check if there is actually a remote SD that connected for doing the replication. First quick stopgap is adding a check if there really is an authenticated storage daemon connection.

mvwieringen 2014-02-11 11:01 developer	0001-Don-t-continue-when-sd-replication-socket-is-not-con.patch (1,031 bytes) From c90ddfd3cfe7afe959f94d22d476d4a8b3d44504 Mon Sep 17 00:00:00 2001 From: Marco van Wieringen <marco.van.wieringen@bareos.com> Date: Tue, 11 Feb 2014 09:49:56 +0100 Subject: [PATCH] Don't continue when sd replication socket is not connected. --- src/stored/sd_cmds.c \| 10 +++++++++- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/src/stored/sd_cmds.c b/src/stored/sd_cmds.c index 26a0fe6..7995ff4 100644 --- a/src/stored/sd_cmds.c +++ b/src/stored/sd_cmds.c @@ -249,7 +249,15 @@ bool do_listen_run(JCR jcr) Dmsg3(50, "Auth=%d canceled=%d errstat=%d\n", jcr->authenticated, job_canceled(jcr), errstat); V(mutex); - Dmsg2(800, "Auth fail or cancel for jid=%d %p\n", jcr->JobId, jcr); + + if (!jcr->authenticated \|\| !jcr->store_bsock) { + Dmsg2(800, "Auth fail or cancel for jid=%d %p\n", jcr->JobId, jcr); + + / + * After a listen cmd we are done e.g. return false. + */ + return false; + } Dmsg1(120, "Start run Job=%s\n", jcr->Job); -- 1.7.3.2 0001-Don-t-continue-when-sd-replication-socket-is-not-con.patch (1,031 bytes)

mvwieringen 2014-02-11 16:59 developer ~0000812	Fix committed to bareos master branch with changesetid 1489.

mvwieringen 2014-02-12 19:48 developer ~0000815	Problem seems to be solved but now the SD jobs hang while the DIR aborts the long running jobs.

mvwieringen 2014-02-25 17:21 developer ~0000823	Fix committed to bareos bareos-13.2 branch with changesetid 1536.

maik 2014-09-05 16:49 administrator ~0000970	never got feedback, assuming that it works now. No more problems reported.

mvwieringen 2015-03-25 16:51 developer ~0001412	Fix committed to bareos2015 bareos-14.2 branch with changesetid 4915.

joergs 2015-03-25 19:19 developer ~0001565	Due to the reimport of the Github repository to bugs.bareos.org, the status of some tickets have been changed. These tickets will be closed again. Sorry for the noise.

bareos: master cf07f93d 2014-02-11 09:49 mvwieringen Ported: N/A Details Diff	SD to SD replication makes SD crash Problem is that we wait up to 30 minutes for the SD-SD replication job to start but it could be that the total migration/copy run takes longer then 30 minutes to finish all jobs. The current code doesn't take this into consideration and doesn't check if there is actually a remote SD that connected for doing the replication. So fix is to not continue when sd replication socket is not connected. And also remove the timeout on starting a SD-SD replication session. The normal FD-SD connection is protected with a timeout so we don't hang when a FD never connects. As we support canceling a storage Job from the director and the director should cleanly cancel the storage job any way when it fails the copy or migration job it should be no problem. In a normal backup/restore there are 3 daemons involved e.g. director, storage daemon and file daemon but with migration the director controls everything and controls the at maximum two storage daemons. Fixes 0000276: SD to SD replication makes SD crash	Affected Issues 0000276
	mod - src/stored/dir_cmd.c	Diff File
	mod - src/stored/sd_cmds.c	Diff File
bareos: bareos-13.2 99c7b1ed 2014-02-11 09:49 mvwieringen Ported: N/A Details Diff	SD to SD replication makes SD crash Problem is that we wait up to 30 minutes for the SD-SD replication job to start but it could be that the total migration/copy run takes longer then 30 minutes to finish all jobs. The current code doesn't take this into consideration and doesn't check if there is actually a remote SD that connected for doing the replication. So fix is to not continue when sd replication socket is not connected. And also remove the timeout on starting a SD-SD replication session. The normal FD-SD connection is protected with a timeout so we don't hang when a FD never connects. As we support canceling a storage Job from the director and the director should cleanly cancel the storage job any way when it fails the copy or migration job it should be no problem. In a normal backup/restore there are 3 daemons involved e.g. director, storage daemon and file daemon but with migration the director controls everything and controls the at maximum two storage daemons. Fixes 0000276: SD to SD replication makes SD crash	Affected Issues 0000276
	mod - src/stored/dir_cmd.c	Diff File
	mod - src/stored/sd_cmds.c	Diff File
bareos2015: bareos-13.2 af148891 2014-02-11 10:49 mvwieringen Ported: N/A Details Diff	SD to SD replication makes SD crash Problem is that we wait up to 30 minutes for the SD-SD replication job to start but it could be that the total migration/copy run takes longer then 30 minutes to finish all jobs. The current code doesn't take this into consideration and doesn't check if there is actually a remote SD that connected for doing the replication. So fix is to not continue when sd replication socket is not connected. And also remove the timeout on starting a SD-SD replication session. The normal FD-SD connection is protected with a timeout so we don't hang when a FD never connects. As we support canceling a storage Job from the director and the director should cleanly cancel the storage job any way when it fails the copy or migration job it should be no problem. In a normal backup/restore there are 3 daemons involved e.g. director, storage daemon and file daemon but with migration the director controls everything and controls the at maximum two storage daemons. Fixes 0000276: SD to SD replication makes SD crash	Affected Issues 0000276
	mod - src/stored/dir_cmd.c	Diff File
	mod - src/stored/sd_cmds.c	Diff File
bareos2015: bareos-14.2 f1a80357 2014-02-11 10:49 mvwieringen Ported: N/A Details Diff	SD to SD replication makes SD crash Problem is that we wait up to 30 minutes for the SD-SD replication job to start but it could be that the total migration/copy run takes longer then 30 minutes to finish all jobs. The current code doesn't take this into consideration and doesn't check if there is actually a remote SD that connected for doing the replication. So fix is to not continue when sd replication socket is not connected. And also remove the timeout on starting a SD-SD replication session. The normal FD-SD connection is protected with a timeout so we don't hang when a FD never connects. As we support canceling a storage Job from the director and the director should cleanly cancel the storage job any way when it fails the copy or migration job it should be no problem. In a normal backup/restore there are 3 daemons involved e.g. director, storage daemon and file daemon but with migration the director controls everything and controls the at maximum two storage daemons. Fixes 0000276: SD to SD replication makes SD crash	Affected Issues 0000276
	mod - src/stored/dir_cmd.c	Diff File
	mod - src/stored/sd_cmds.c	Diff File

Date Modified	Username	Field	Change
2014-02-11 10:52	mvwieringen	New Issue
2014-02-11 10:52	mvwieringen	Status	new => assigned
2014-02-11 10:52	mvwieringen	Assigned To	=> mvwieringen
2014-02-11 11:01	mvwieringen	Note Added: 0000809
2014-02-11 11:01	mvwieringen	File Added: 0001-Don-t-continue-when-sd-replication-socket-is-not-con.patch
2014-02-11 16:59	mvwieringen	Changeset attached	=> bareos master cf07f93d
2014-02-11 16:59	mvwieringen	Note Added: 0000812
2014-02-11 16:59	mvwieringen	Status	assigned => resolved
2014-02-11 16:59	mvwieringen	Resolution	open => fixed
2014-02-12 19:48	mvwieringen	Note Added: 0000815
2014-02-12 19:48	mvwieringen	Status	resolved => feedback
2014-02-12 19:48	mvwieringen	Resolution	fixed => reopened
2014-02-25 17:21	mvwieringen	Changeset attached	=> bareos bareos-13.2 99c7b1ed
2014-02-25 17:21	mvwieringen	Note Added: 0000823
2014-02-25 17:21	mvwieringen	Status	feedback => assigned
2014-02-25 17:21	mvwieringen	Status	assigned => resolved
2014-02-28 18:49	mvwieringen	Status	resolved => feedback
2014-05-16 17:30	~~mvwieringen adm~~	Assigned To	mvwieringen =>
2014-09-05 16:49	maik	Note Added: 0000970
2014-09-05 16:49	maik	Status	feedback => closed
2014-09-05 16:49	maik	Assigned To	=> maik
2014-09-05 16:49	maik	Resolution	reopened => no change required
2015-01-26 16:23	~~mvwieringen adm~~	Assigned To	maik =>
2015-03-25 16:51	mvwieringen	Changeset attached	=> bareos2015 bareos-13.2 af148891
2015-03-25 16:51	mvwieringen	Changeset attached	=> bareos2015 bareos-14.2 f1a80357
2015-03-25 16:51	mvwieringen	Note Added: 0001412
2015-03-25 16:51	mvwieringen	Status	closed => resolved
2015-03-25 16:51	mvwieringen	Resolution	no change required => fixed
2015-03-25 19:19	joergs	Note Added: 0001565
2015-03-25 19:19	joergs	Status	resolved => closed

Reporting new Issues is disabled, please Report new Issues at https://github.com/bareos/bareos/issues

View Issue Details

Activities

Related Changesets

Issue History