View Issue Details

IDProjectCategoryView StatusLast Update
0000763bareos-core[All Projects] directorpublic2017-06-08 15:23
ReporterrobhostAssigned Tojoergs 
PriorityhighSeveritymajorReproducibilityalways
Status resolvedResolutionfixed 
Platformx86_64OSCentOSOS Version7
Product Version16.2.4 
Target VersionFixed in Version 
Summary0000763: some Backups fail after 605 seconds in a schedule of 170 server
DescriptionSome Backups in a schedule of 170 server fail after 605 seconds (Fullbackup). It seems that the director can not open any more connection to the Clients or the Storage Daemon (separate server). In our setup are running 10 jobs parallel with a bandwidth of 800 mb/s (Maximum 3GB/s).

Log extract from a full backup:

09-Jan 11:46 backup JobId 26: Fatal error: Director unable to authenticate with Storage daemon at "storage.daemon.com:9103". Possible causes:
Passwords or names not the same or
TLS negotiation problem or
Maximum Concurrent Jobs exceeded on the SD or
SD networking messed up (restart daemon).
Please see http://doc.bareos.org/master/html/bareos-manual-main-reference.html#AuthorizationErrors for help.
09-Jan 11:46 backup director JobId 26: Error: Bareos backup director 16.2.4 (01Jul16):
  Build OS: x86_64-redhat-linux-gnu redhat Red Hat Enterprise Linux Server release 7.0 (Maipo)
  JobId: 26
  Job: SERVER.2017-01-09_11.00.00_29
  Backup Level: Full (upgraded from Differential)
  Client: "SERVER"
  FileSet: "SERVER" 2017-01-09 11:00:00
  Pool: "SERVER" (From Job resource)
  Catalog: "MyCatalog" (From Client resource)
  Storage: "SERVER" (From Job resource)
  Scheduled time: 09-Jan-2017 11:00:00
  Start time: 09-Jan-2017 11:36:38
  End time: 09-Jan-2017 11:46:43
  Elapsed time: 10 mins 5 secs
  Priority: 10
  FD Files Written: 0
  SD Files Written: 0
  FD Bytes Written: 0 (0 B)
  SD Bytes Written: 0 (0 B)
  Rate: 0.0 KB/s
  Software Compression: None
  VSS: no
  Encryption: no
  Accurate: no
  Volume name(s):
  Volume Session Id: 0
  Volume Session Time: 0
  Last Volume Bytes: 0 (0 B)
  Non-fatal FD errors: 1
  SD Errors: 0
  FD termination status:
  SD termination status:
  FD Secure Erase Cmd: <NULL>
  SD Secure Erase Cmd: <NULL>
  Termination: *** Backup Error ***

But when we manually start a backup everything is fine:

06-Jan 12:47 backup director JobId 26: Bareos backup director 16.2.4 (01Jul16):
  Build OS: x86_64-redhat-linux-gnu redhat Red Hat Enterprise Linux Server release 7.0 (Maipo)
  JobId: 26
  Job: SERVER.2017-01-06_12.00.00_29
  Backup Level: Full (upgraded from Differential)
  Client: "SERVER" 16.2.4 (01Jul16) x86_64-redhat-linux-gnu,redhat,CentOS release 6.6 (Final),CentOS_6,x86_64
  FileSet: "SERVER" 2017-01-06 12:00:00
  Pool: "SERVER" (From Job resource)
  Catalog: "MyCatalog" (From Client resource)
  Storage: "SERVER" (From Job resource)
  Scheduled time: 06-Jan-2017 12:00:00
  Start time: 06-Jan-2017 12:28:28
  End time: 06-Jan-2017 12:47:14
  Elapsed time: 18 mins 46 secs
  Priority: 10
  FD Files Written: 227,394
  SD Files Written: 227,394
  FD Bytes Written: 8,556,018,039 (8.556 GB)
  SD Bytes Written: 8,598,189,206 (8.598 GB)
  Rate: 7598.6 KB/s
  Software Compression: 8.1 % (lzfast)
  VSS: no
  Encryption: no
  Accurate: no
  Volume name(s): SERVER-0030
  Volume Session Id: 26
  Volume Session Time: 1483698635
  Last Volume Bytes: 8,611,541,897 (8.611 GB)
  Non-fatal FD errors: 0
  SD Errors: 0
  FD termination status: OK
  SD termination status: OK
  Termination: Backup OK
Steps To ReproduceWe just have to start a new full backup that is scheduled, the same problem happens to differential backups.
TagsNo tags attached.
bareos-master: impactyes
bareos-master: actionwill care
bareos-18.2: impact
bareos-18.2: action
bareos-17.2: impact
bareos-17.2: action
bareos-16.2: impactyes
bareos-16.2: actionwill care
bareos-15.2: impactyes
bareos-15.2: action
bareos-14.2: impactno
bareos-14.2: action
bareos-13.2: impactno
bareos-13.2: action
bareos-12.4: impactno
bareos-12.4: action

Relationships

child of 0000794 closedstephand Release bareos-16.2.6 
child of 0000798 assignedmaik Release bareos-15.2.5 

Activities

robhost

robhost

2017-01-31 10:08

reporter   ~0002541

Last edited: 2017-01-31 11:19

View 5 revisions

We found the error. If we set the Maximum Concurrent Jobs settings like in the documentation(every number of Maximum Concurrent Jobs is the same FD,SD and Director), the bareos directors create this errors.
We set the Maximum Concurrent Jobs on the storage daemon to 10 and the Maximum Concurrent Jobs on the director also to 10.
If we now run a full backup of our 170 hosts and we have established 10 jobs to backup the server and 1 job is about to finish but has a delay and the director established already a new job(connection) to start the new backup, it will run in an hardcoded 600 second timeout. The connection is not closed and there are also no retries, managed by the scheduler.
So as a work around, we set the SD connections to (Director-Connections + 10) to avoid this behavior. If the connections are all used, it is also not possibly to check the status of a storage (we have multiple storage and devices to paralyze the backup process) via bconsole (status storage=<servername>).
This will also end in an timeout of 600 seconds.

dirkd-robhost

dirkd-robhost

2017-01-31 12:14

reporter  

0001-fix-director-connections-not-closing-on-job-slot-exhaustion.patch (401 bytes)
diff --git a/src/stored/dir_cmd.c b/src/stored/dir_cmd.c
index 3c25bea..16e36c0 100644
--- a/src/stored/dir_cmd.c
+++ b/src/stored/dir_cmd.c
@@ -235,6 +235,7 @@ void *handle_director_connection(BSOCK *dir)

    if (!count_running_jobs()) {
       Emsg0(M_ERROR, 0, _("Number of Jobs exhausted, please increase MaximumConcurrentJobs\n"));
+      dir->signal(BNET_TERMINATE);
       return NULL;
    }

dirkd-robhost

dirkd-robhost

2017-01-31 12:14

reporter   ~0002542

We further debugged the problem and discovered that - in addition to the configuration problem described in comment 0000763:0002541 - the extremely long no-op of 10 minutes we're seeing is related to a leaked socket in the StorageDaemon code.

The cause of this is https://github.com/bareos/bareos/blob/Release/16.2.4/src/stored/dir_cmd.c#L236-L239 where the socket handling function simply returns without closing the socket when there is no more room for another job due to the configured MaximumConcurrentJobs configuration.
This causes the handling thread on the StorageDaemon to exit and leak the socket. The connection is then closed from the initiating end (director) after hitting its hardcoded timeout auf 10 minutes.

Since this check was introduced in 15.2.3 every version since then should be affected by this behavior. We can only confirm this behavior for 16.2.4 though since that is what we used to test and debug it.

I've added a patch for review.
joergs

joergs

2017-03-21 11:43

administrator   ~0002615

Thank you for the patch. It is added to our intern integration system. It will be pushed to github soon.
joergs

joergs

2017-04-10 12:53

administrator   ~0002624

Fix committed to bareos master branch with changesetid 6944.
joergs

joergs

2017-06-08 15:23

administrator   ~0002652

Fix committed to bareos bareos-15.2 branch with changesetid 7023.

Related Changesets

bareos: master ef9eb60f

2017-03-20 12:03:24

joergs

Ported: N/A

Details Diff
bareos-sd: close network connection if job isn't started

If the Director tries to start a new job on the SD, the SD checks,
if the MaximumConcurrentJobs criteria meet. If not, no new job is
started.
This fix will close the network connection.

Fixes 0000763: some Backups fail after 605 seconds in a schedule of 170 server
Affected Issues
0000763
mod - src/stored/dir_cmd.c Diff File

bareos: bareos-15.2 adfecb42

2017-03-20 12:03:24

joergs

Ported: N/A

Details Diff
bareos-sd: close network connection if job isn't started

If the Director tries to start a new job on the SD, the SD checks,
if the MaximumConcurrentJobs criteria meet. If not, no new job is
started.
This fix will close the network connection.

Fixes 0000763: some Backups fail after 605 seconds in a schedule of 170 server
Affected Issues
0000763
mod - src/stored/dir_cmd.c Diff File

Issue History

Date Modified Username Field Change
2017-01-11 15:26 robhost New Issue
2017-01-31 10:08 robhost Note Added: 0002541
2017-01-31 10:15 robhost Note Edited: 0002541 View Revisions
2017-01-31 10:49 robhost Note Edited: 0002541 View Revisions
2017-01-31 10:52 robhost Note Edited: 0002541 View Revisions
2017-01-31 11:19 robhost Note Edited: 0002541 View Revisions
2017-01-31 12:14 dirkd-robhost File Added: 0001-fix-director-connections-not-closing-on-job-slot-exhaustion.patch
2017-01-31 12:14 dirkd-robhost Note Added: 0002542
2017-03-17 14:09 joergs bareos-master: impact => yes
2017-03-17 14:09 joergs bareos-master: action => will care
2017-03-17 14:09 joergs bareos-16.2: impact => yes
2017-03-17 14:09 joergs bareos-16.2: action => will care
2017-03-17 14:09 joergs bareos-15.2: impact => yes
2017-03-17 14:09 joergs bareos-14.2: impact => no
2017-03-17 14:09 joergs bareos-13.2: impact => no
2017-03-17 14:09 joergs bareos-12.4: impact => no
2017-03-17 14:12 joergs Assigned To => joergs
2017-03-17 14:12 joergs Status new => assigned
2017-03-20 11:59 joergs Relationship added child of 0000794
2017-03-20 13:08 joergs Relationship added child of 0000798
2017-03-21 11:43 joergs Note Added: 0002615
2017-04-10 12:53 joergs Changeset attached => bareos master ef9eb60f
2017-04-10 12:53 joergs Note Added: 0002624
2017-04-10 12:53 joergs Status assigned => resolved
2017-04-10 12:53 joergs Resolution open => fixed
2017-06-08 15:23 joergs Changeset attached => bareos bareos-15.2 adfecb42
2017-06-08 15:23 joergs Note Added: 0002652