View Issue Details

IDProjectCategoryView StatusLast Update
0000276bareos-corestorage daemonpublic2015-03-25 19:19
Reportermvwieringen Assigned To 
PrioritynormalSeveritymajorReproducibilityrandom
Status closedResolutionfixed 
PlatformLinuxOSanyOS Version3
Product Version13.2.2 
Summary0000276: SD to SD replication makes SD crash
DescriptionWe are having issues with Copy jobs from one SD (primary) to a backup site SD (backup).

Each night, after the primary backups are done, the jobs are copied with a Copy job to another SD running off
site. Randomly, the SD segfaults without notice during the process. It can be in the first jobs or the
lasts. Also, it can run for some days with no crash.

I compiled the latest Git version (january 30) and the problem is still there.
I have ran the SD with GDB and it crashed this weekend. I have attached the output in the post.

I'm not very good in debugging but it seems the jcr->store_bsock is not initialized (sd = address 0x0) and it
tries to access a method of the class (sd->set_jcr).

It's causing us a lot of problems since I have to restart it each time and transfer the jobs again by hand.
Additional InformationProgram received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff32d0700 (LWP 31752)]
0x0000000000429fea in BSOCK::set_jcr (this=0x0, jcr=0x6a2728) at ../lib/bsock.h:133
133 void set_jcr(JCR *jcr) { m_jcr = jcr; };

(gdb) bt
#0 0x0000000000429fea in BSOCK::set_jcr (this=0x0, jcr=0x6a2728) at ../lib/bsock.h:133
0000001 0x0000000000441d86 in do_sd_commands (jcr=0x6a2728) at sd_cmds.c:156
0000002 0x00000000004424d5 in do_listen_run (jcr=0x6a2728) at sd_cmds.c:271
0000003 0x000000000042951d in listen_cmd (jcr=0x6a2728) at dir_cmd.c:1501
0000004 0x0000000000425251 in handle_director_connection (dir=0x69d578, job_name=0x7ffff32cfad0
"BackupFreeswitchMysql.2014-02-06_10.42.35_01") at dir_cmd.c:273
0000005 0x000000000042572c in handle_connection_request (arg=0x69d578) at dir_cmd.c:356
0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335
0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x698e88) at lockmgr.c:932
0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0
0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6
0000010 0x0000000000000000 in ?? ()

(gdb) thread apply all bt

Thread 711 (Thread 0x7ffff12cc700 (LWP 31759)):
#0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0
0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x71da78, m=0x670fa0 <_ZL5mutex>,
abstime=0x7ffff12cb980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817
0000002 0x0000000000442210 in do_listen_run (jcr=0x71d748) at sd_cmds.c:243
0000003 0x000000000042951d in listen_cmd (jcr=0x71d748) at dir_cmd.c:1501
0000004 0x0000000000425251 in handle_director_connection (dir=0x71bce8, job_name=0x7ffff12cbad0 "") at dir_cmd.c:273
0000005 0x000000000042572c in handle_connection_request (arg=0x71bce8) at dir_cmd.c:356
0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335
0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x70bba8) at lockmgr.c:932
0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0
0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6
0000010 0x0000000000000000 in ?? ()

Thread 710 (Thread 0x7ffff42d2700 (LWP 31756)):
#0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0
0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x6ab0b8, m=0x670fa0 <_ZL5mutex>,
abstime=0x7ffff42d1980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817
0000002 0x0000000000442210 in do_listen_run (jcr=0x6aad88) at sd_cmds.c:243
0000003 0x000000000042951d in listen_cmd (jcr=0x6aad88) at dir_cmd.c:1501
0000004 0x0000000000425251 in handle_director_connection (dir=0x6a9328, job_name=0x7ffff42d1ad0 "") at dir_cmd.c:273
0000005 0x000000000042572c in handle_connection_request (arg=0x6a9328) at dir_cmd.c:356
0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335
0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x6a8008) at lockmgr.c:932
0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0
0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6
0000010 0x0000000000000000 in ?? ()

Thread 709 (Thread 0x7ffff2acf700 (LWP 31755)):
#0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0
0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x6a7928, m=0x670fa0 <_ZL5mutex>,
abstime=0x7ffff2ace980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817
0000002 0x0000000000442210 in do_listen_run (jcr=0x6a75f8) at sd_cmds.c:243
0000003 0x000000000042951d in listen_cmd (jcr=0x6a75f8) at dir_cmd.c:1501
0000004 0x0000000000425251 in handle_director_connection (dir=0x6a5cc8, job_name=0x7ffff2acead0
"BackupCanisMinorMysql.2014-02-06_10.42.19_09") at dir_cmd.c:273
0000005 0x000000000042572c in handle_connection_request (arg=0x6a5cc8) at dir_cmd.c:356
0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335
0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x6a36e8) at lockmgr.c:932
0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0
0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6
0000010 0x0000000000000000 in ?? ()

Thread 708 (Thread 0x7ffff1acd700 (LWP 31754)):
#0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0
0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x6a4f48, m=0x670fa0 <_ZL5mutex>,
abstime=0x7ffff1acc980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817
0000002 0x0000000000442210 in do_listen_run (jcr=0x6a4c18) at sd_cmds.c:243
0000003 0x000000000042951d in listen_cmd (jcr=0x6a4c18) at dir_cmd.c:1501
0000004 0x0000000000425251 in handle_director_connection (dir=0x69f008, job_name=0x7ffff1accad0
"BackupBaseFilesGalileo.2014-02-06_10.42.20_15") at dir_cmd.c:273
0000005 0x000000000042572c in handle_connection_request (arg=0x69f008) at dir_cmd.c:356
0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335
0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x6a35f8) at lockmgr.c:932
0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0
0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6
0000010 0x0000000000000000 in ?? ()

Thread 707 (Thread 0x7ffff22ce700 (LWP 31753)):
---Type <return> to continue, or q <return> to quit---
#0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0
0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x69c368, m=0x670fa0 <_ZL5mutex>,
abstime=0x7ffff22cd980, file=0x460fcb "sd_cmds.c", line=243) at lockmgr.c:817
0000002 0x0000000000442210 in do_listen_run (jcr=0x69c038) at sd_cmds.c:243
0000003 0x000000000042951d in listen_cmd (jcr=0x69c038) at dir_cmd.c:1501
0000004 0x0000000000425251 in handle_director_connection (dir=0x6a1368, job_name=0x7ffff22cdad0
"BackupCatalog.2014-02-07_08.00.00_26") at dir_cmd.c:273
0000005 0x000000000042572c in handle_connection_request (arg=0x6a1368) at dir_cmd.c:356
0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335
0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x69ae78) at lockmgr.c:932
0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0
0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6
0000010 0x0000000000000000 in ?? ()

Thread 706 (Thread 0x7ffff32d0700 (LWP 31752)):
#0 0x0000000000429fea in BSOCK::set_jcr (this=0x0, jcr=0x6a2728) at ../lib/bsock.h:133
0000001 0x0000000000441d86 in do_sd_commands (jcr=0x6a2728) at sd_cmds.c:156
0000002 0x00000000004424d5 in do_listen_run (jcr=0x6a2728) at sd_cmds.c:271
0000003 0x000000000042951d in listen_cmd (jcr=0x6a2728) at dir_cmd.c:1501
0000004 0x0000000000425251 in handle_director_connection (dir=0x69d578, job_name=0x7ffff32cfad0
"BackupFreeswitchMysql.2014-02-06_10.42.35_01") at dir_cmd.c:273
0000005 0x000000000042572c in handle_connection_request (arg=0x69d578) at dir_cmd.c:356
0000006 0x00007ffff770e29a in workq_server (arg=0x671320 <dird_workq>) at workq.c:335
0000007 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x698e88) at lockmgr.c:932
0000008 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0
0000009 0x00007ffff580a1cd in clone () from /lib/libc.so.6
0000010 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7ffff3ad1700 (LWP 31760)):
#0 0x00007ffff6874bc9 in pthread_cond_timedwait <at> <at> GLIBC_2.3.2 () from /lib/libpthread.so.0
0000001 0x00007ffff76ed2df in bthread_cond_timedwait_p (cond=0x7ffff7926800 <_ZL5timer>,
m=0x7ffff79267c0 <_ZL11timer_mutex>, abstime=0x7ffff3ad0d80, file=0x7ffff7717bea
"watchdog.c", line=313)
    at lockmgr.c:817
0000002 0x00007ffff770d524 in watchdog_thread (arg=0x0) at watchdog.c:313
0000003 0x00007ffff76ed6ea in lmgr_thread_launcher (x=0x698e88) at lockmgr.c:932
0000004 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0
0000005 0x00007ffff580a1cd in clone () from /lib/libc.so.6
0000006 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7ffff4ad3700 (LWP 31758)):
#0 0x00007ffff687811d in nanosleep () from /lib/libpthread.so.0
0000001 0x00007ffff76d93cc in bmicrosleep (sec=30, usec=0) at bsys.c:100
0000002 0x00007ffff76ec9b1 in check_deadlock () at lockmgr.c:567
0000003 0x00007ffff686f9ca in start_thread () from /lib/libpthread.so.0
0000004 0x00007ffff580a1cd in clone () from /lib/libc.so.6
0000005 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7ffff7fe8720 (LWP 31757)):
#0 0x00007ffff57fda33 in poll () from /lib/libc.so.6
0000001 0x00007ffff76ca41c in bnet_thread_server_tcp (addr_list=0x6810f8, max_clients=41,
sockfds=0x69aba8, client_wq=0x671320 <dird_workq>, nokeepalive=false,
    handle_client_request=0x4253bc <handle_connection_request(void*)>) at bnet_server_tcp.c:298
0000002 0x000000000044c80d in main (argc=0, argv=0x7fffffffe7c8) at stored.c:306
TagsNo tags attached.

Activities

mvwieringen

mvwieringen

2014-02-11 11:01

developer   ~0000809

First analyzing this shows that it fails in do_listen_run()

Problem is that we wait up to 30 minutes for the SD-SD replication
job to start but it could be that the total migration/copy run takes
longer then 30 minutes to finish all jobs. The current code doesn't take
this into consideration and doesn't check if there is actually a remote
SD that connected for doing the replication.

First quick stopgap is adding a check if there really is an authenticated
storage daemon connection.
mvwieringen

mvwieringen

2014-02-11 11:01

developer  

0001-Don-t-continue-when-sd-replication-socket-is-not-con.patch (1,031 bytes)   
From c90ddfd3cfe7afe959f94d22d476d4a8b3d44504 Mon Sep 17 00:00:00 2001
From: Marco van Wieringen <marco.van.wieringen@bareos.com>
Date: Tue, 11 Feb 2014 09:49:56 +0100
Subject: [PATCH] Don't continue when sd replication socket is not connected.

---
 src/stored/sd_cmds.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/src/stored/sd_cmds.c b/src/stored/sd_cmds.c
index 26a0fe6..7995ff4 100644
--- a/src/stored/sd_cmds.c
+++ b/src/stored/sd_cmds.c
@@ -249,7 +249,15 @@ bool do_listen_run(JCR *jcr)
    Dmsg3(50, "Auth=%d canceled=%d errstat=%d\n", jcr->authenticated,
          job_canceled(jcr), errstat);
    V(mutex);
-   Dmsg2(800, "Auth fail or cancel for jid=%d %p\n", jcr->JobId, jcr);
+
+   if (!jcr->authenticated || !jcr->store_bsock) {
+      Dmsg2(800, "Auth fail or cancel for jid=%d %p\n", jcr->JobId, jcr);
+
+      /*
+       * After a listen cmd we are done e.g. return false.
+       */
+      return false;
+   }
 
    Dmsg1(120, "Start run Job=%s\n", jcr->Job);
 
-- 
1.7.3.2

mvwieringen

mvwieringen

2014-02-11 16:59

developer   ~0000812

Fix committed to bareos master branch with changesetid 1489.
mvwieringen

mvwieringen

2014-02-12 19:48

developer   ~0000815

Problem seems to be solved but now the SD jobs hang while the DIR aborts the
long running jobs.
mvwieringen

mvwieringen

2014-02-25 17:21

developer   ~0000823

Fix committed to bareos bareos-13.2 branch with changesetid 1536.
maik

maik

2014-09-05 16:49

administrator   ~0000970

never got feedback, assuming that it works now. No more problems reported.
mvwieringen

mvwieringen

2015-03-25 16:51

developer   ~0001412

Fix committed to bareos2015 bareos-14.2 branch with changesetid 4915.
joergs

joergs

2015-03-25 19:19

developer   ~0001565

Due to the reimport of the Github repository to bugs.bareos.org, the status of some tickets have been changed. These tickets will be closed again.
Sorry for the noise.

Related Changesets

bareos: master cf07f93d

2014-02-11 09:49

mvwieringen

Ported: N/A

Details Diff
SD to SD replication makes SD crash

Problem is that we wait up to 30 minutes for the SD-SD replication
job to start but it could be that the total migration/copy run takes
longer then 30 minutes to finish all jobs. The current code doesn't take
this into consideration and doesn't check if there is actually a remote
SD that connected for doing the replication.

So fix is to not continue when sd replication socket is not connected.
And also remove the timeout on starting a SD-SD replication session.
The normal FD-SD connection is protected with a timeout so we don't hang
when a FD never connects. As we support canceling a storage Job from the
director and the director should cleanly cancel the storage job any way
when it fails the copy or migration job it should be no problem.

In a normal backup/restore there are 3 daemons involved e.g. director,
storage daemon and file daemon but with migration the director controls
everything and controls the at maximum two storage daemons.

Fixes 0000276: SD to SD replication makes SD crash
Affected Issues
0000276
mod - src/stored/dir_cmd.c Diff File
mod - src/stored/sd_cmds.c Diff File

bareos: bareos-13.2 99c7b1ed

2014-02-11 09:49

mvwieringen

Ported: N/A

Details Diff
SD to SD replication makes SD crash

Problem is that we wait up to 30 minutes for the SD-SD replication
job to start but it could be that the total migration/copy run takes
longer then 30 minutes to finish all jobs. The current code doesn't take
this into consideration and doesn't check if there is actually a remote
SD that connected for doing the replication.

So fix is to not continue when sd replication socket is not connected.
And also remove the timeout on starting a SD-SD replication session.
The normal FD-SD connection is protected with a timeout so we don't hang
when a FD never connects. As we support canceling a storage Job from the
director and the director should cleanly cancel the storage job any way
when it fails the copy or migration job it should be no problem.

In a normal backup/restore there are 3 daemons involved e.g. director,
storage daemon and file daemon but with migration the director controls
everything and controls the at maximum two storage daemons.

Fixes 0000276: SD to SD replication makes SD crash
Affected Issues
0000276
mod - src/stored/dir_cmd.c Diff File
mod - src/stored/sd_cmds.c Diff File

bareos2015: bareos-13.2 af148891

2014-02-11 10:49

mvwieringen

Ported: N/A

Details Diff
SD to SD replication makes SD crash

Problem is that we wait up to 30 minutes for the SD-SD replication
job to start but it could be that the total migration/copy run takes
longer then 30 minutes to finish all jobs. The current code doesn't take
this into consideration and doesn't check if there is actually a remote
SD that connected for doing the replication.

So fix is to not continue when sd replication socket is not connected.
And also remove the timeout on starting a SD-SD replication session.
The normal FD-SD connection is protected with a timeout so we don't hang
when a FD never connects. As we support canceling a storage Job from the
director and the director should cleanly cancel the storage job any way
when it fails the copy or migration job it should be no problem.

In a normal backup/restore there are 3 daemons involved e.g. director,
storage daemon and file daemon but with migration the director controls
everything and controls the at maximum two storage daemons.

Fixes 0000276: SD to SD replication makes SD crash
Affected Issues
0000276
mod - src/stored/dir_cmd.c Diff File
mod - src/stored/sd_cmds.c Diff File

bareos2015: bareos-14.2 f1a80357

2014-02-11 10:49

mvwieringen

Ported: N/A

Details Diff
SD to SD replication makes SD crash

Problem is that we wait up to 30 minutes for the SD-SD replication
job to start but it could be that the total migration/copy run takes
longer then 30 minutes to finish all jobs. The current code doesn't take
this into consideration and doesn't check if there is actually a remote
SD that connected for doing the replication.

So fix is to not continue when sd replication socket is not connected.
And also remove the timeout on starting a SD-SD replication session.
The normal FD-SD connection is protected with a timeout so we don't hang
when a FD never connects. As we support canceling a storage Job from the
director and the director should cleanly cancel the storage job any way
when it fails the copy or migration job it should be no problem.

In a normal backup/restore there are 3 daemons involved e.g. director,
storage daemon and file daemon but with migration the director controls
everything and controls the at maximum two storage daemons.

Fixes 0000276: SD to SD replication makes SD crash
Affected Issues
0000276
mod - src/stored/dir_cmd.c Diff File
mod - src/stored/sd_cmds.c Diff File

Issue History

Date Modified Username Field Change
2014-02-11 10:52 mvwieringen New Issue
2014-02-11 10:52 mvwieringen Status new => assigned
2014-02-11 10:52 mvwieringen Assigned To => mvwieringen
2014-02-11 11:01 mvwieringen Note Added: 0000809
2014-02-11 11:01 mvwieringen File Added: 0001-Don-t-continue-when-sd-replication-socket-is-not-con.patch
2014-02-11 16:59 mvwieringen Changeset attached => bareos master cf07f93d
2014-02-11 16:59 mvwieringen Note Added: 0000812
2014-02-11 16:59 mvwieringen Status assigned => resolved
2014-02-11 16:59 mvwieringen Resolution open => fixed
2014-02-12 19:48 mvwieringen Note Added: 0000815
2014-02-12 19:48 mvwieringen Status resolved => feedback
2014-02-12 19:48 mvwieringen Resolution fixed => reopened
2014-02-25 17:21 mvwieringen Changeset attached => bareos bareos-13.2 99c7b1ed
2014-02-25 17:21 mvwieringen Note Added: 0000823
2014-02-25 17:21 mvwieringen Status feedback => assigned
2014-02-25 17:21 mvwieringen Status assigned => resolved
2014-02-28 18:49 mvwieringen Status resolved => feedback
2014-05-16 17:30 mvwieringen adm Assigned To mvwieringen =>
2014-09-05 16:49 maik Note Added: 0000970
2014-09-05 16:49 maik Status feedback => closed
2014-09-05 16:49 maik Assigned To => maik
2014-09-05 16:49 maik Resolution reopened => no change required
2015-01-26 16:23 mvwieringen adm Assigned To maik =>
2015-03-25 16:51 mvwieringen Changeset attached => bareos2015 bareos-13.2 af148891
2015-03-25 16:51 mvwieringen Changeset attached => bareos2015 bareos-14.2 f1a80357
2015-03-25 16:51 mvwieringen Note Added: 0001412
2015-03-25 16:51 mvwieringen Status closed => resolved
2015-03-25 16:51 mvwieringen Resolution no change required => fixed
2015-03-25 19:19 joergs Note Added: 0001565
2015-03-25 19:19 joergs Status resolved => closed