Bareos Bug Tracker
Bareos Bug Tracker

View Issue Details Jump to Notes ] Related Changesets ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000763bareos-core[All Projects] directorpublic2017-01-11 15:262017-06-08 15:23
Reporterrobhost 
Assigned Tojoergs 
PriorityhighSeveritymajorReproducibilityalways
StatusresolvedResolutionfixed 
Platformx86_64OSCentOSOS Version7
Product Version16.2.4 
Target VersionFixed in Version 
Summary0000763: some Backups fail after 605 seconds in a schedule of 170 server
DescriptionSome Backups in a schedule of 170 server fail after 605 seconds (Fullbackup). It seems that the director can not open any more connection to the Clients or the Storage Daemon (separate server). In our setup are running 10 jobs parallel with a bandwidth of 800 mb/s (Maximum 3GB/s).

Log extract from a full backup:

09-Jan 11:46 backup JobId 26: Fatal error: Director unable to authenticate with Storage daemon at "storage.daemon.com:9103". Possible causes:
Passwords or names not the same or
TLS negotiation problem or
Maximum Concurrent Jobs exceeded on the SD or
SD networking messed up (restart daemon).
Please see http://doc.bareos.org/master/html/bareos-manual-main-reference.html#AuthorizationErrors [^] for help.
09-Jan 11:46 backup director JobId 26: Error: Bareos backup director 16.2.4 (01Jul16):
  Build OS: x86_64-redhat-linux-gnu redhat Red Hat Enterprise Linux Server release 7.0 (Maipo)
  JobId: 26
  Job: SERVER.2017-01-09_11.00.00_29
  Backup Level: Full (upgraded from Differential)
  Client: "SERVER"
  FileSet: "SERVER" 2017-01-09 11:00:00
  Pool: "SERVER" (From Job resource)
  Catalog: "MyCatalog" (From Client resource)
  Storage: "SERVER" (From Job resource)
  Scheduled time: 09-Jan-2017 11:00:00
  Start time: 09-Jan-2017 11:36:38
  End time: 09-Jan-2017 11:46:43
  Elapsed time: 10 mins 5 secs
  Priority: 10
  FD Files Written: 0
  SD Files Written: 0
  FD Bytes Written: 0 (0 B)
  SD Bytes Written: 0 (0 B)
  Rate: 0.0 KB/s
  Software Compression: None
  VSS: no
  Encryption: no
  Accurate: no
  Volume name(s):
  Volume Session Id: 0
  Volume Session Time: 0
  Last Volume Bytes: 0 (0 B)
  Non-fatal FD errors: 1
  SD Errors: 0
  FD termination status:
  SD termination status:
  FD Secure Erase Cmd: <NULL>
  SD Secure Erase Cmd: <NULL>
  Termination: *** Backup Error ***

But when we manually start a backup everything is fine:

06-Jan 12:47 backup director JobId 26: Bareos backup director 16.2.4 (01Jul16):
  Build OS: x86_64-redhat-linux-gnu redhat Red Hat Enterprise Linux Server release 7.0 (Maipo)
  JobId: 26
  Job: SERVER.2017-01-06_12.00.00_29
  Backup Level: Full (upgraded from Differential)
  Client: "SERVER" 16.2.4 (01Jul16) x86_64-redhat-linux-gnu,redhat,CentOS release 6.6 (Final),CentOS_6,x86_64
  FileSet: "SERVER" 2017-01-06 12:00:00
  Pool: "SERVER" (From Job resource)
  Catalog: "MyCatalog" (From Client resource)
  Storage: "SERVER" (From Job resource)
  Scheduled time: 06-Jan-2017 12:00:00
  Start time: 06-Jan-2017 12:28:28
  End time: 06-Jan-2017 12:47:14
  Elapsed time: 18 mins 46 secs
  Priority: 10
  FD Files Written: 227,394
  SD Files Written: 227,394
  FD Bytes Written: 8,556,018,039 (8.556 GB)
  SD Bytes Written: 8,598,189,206 (8.598 GB)
  Rate: 7598.6 KB/s
  Software Compression: 8.1 % (lzfast)
  VSS: no
  Encryption: no
  Accurate: no
  Volume name(s): SERVER-0030
  Volume Session Id: 26
  Volume Session Time: 1483698635
  Last Volume Bytes: 8,611,541,897 (8.611 GB)
  Non-fatal FD errors: 0
  SD Errors: 0
  FD termination status: OK
  SD termination status: OK
  Termination: Backup OK
Steps To ReproduceWe just have to start a new full backup that is scheduled, the same problem happens to differential backups.
TagsNo tags attached.
bareos-master: impactyes
bareos-master: actionwill care
bareos-16.2: impactyes
bareos-16.2: actionwill care
bareos-15.2: impactyes
bareos-15.2: action
bareos-14.2: impactno
bareos-14.2: action
bareos-13.2: impactno
bareos-13.2: action
bareos-12.4: impactno
bareos-12.4: action
Attached Filespatch file icon 0001-fix-director-connections-not-closing-on-job-slot-exhaustion.patch [^] (401 bytes) 2017-01-31 12:14 [Show Content]

- Relationships
child of 0000794assignedstephand Release bareos-16.2.6 
child of 0000798assignedmaik Release bareos-15.2.5 

-  Notes
(0002541)
robhost (reporter)
2017-01-31 10:08
edited on: 2017-01-31 11:19

We found the error. If we set the Maximum Concurrent Jobs settings like in the documentation(every number of Maximum Concurrent Jobs is the same FD,SD and Director), the bareos directors create this errors.
We set the Maximum Concurrent Jobs on the storage daemon to 10 and the Maximum Concurrent Jobs on the director also to 10.
If we now run a full backup of our 170 hosts and we have established 10 jobs to backup the server and 1 job is about to finish but has a delay and the director established already a new job(connection) to start the new backup, it will run in an hardcoded 600 second timeout. The connection is not closed and there are also no retries, managed by the scheduler.
So as a work around, we set the SD connections to (Director-Connections + 10) to avoid this behavior. If the connections are all used, it is also not possibly to check the status of a storage (we have multiple storage and devices to paralyze the backup process) via bconsole (status storage=<servername>).
This will also end in an timeout of 600 seconds.

(0002542)
dirkd-robhost (reporter)
2017-01-31 12:14

We further debugged the problem and discovered that - in addition to the configuration problem described in comment 0000763:0002541 - the extremely long no-op of 10 minutes we're seeing is related to a leaked socket in the StorageDaemon code.

The cause of this is https://github.com/bareos/bareos/blob/Release/16.2.4/src/stored/dir_cmd.c#L236-L239 [^] where the socket handling function simply returns without closing the socket when there is no more room for another job due to the configured MaximumConcurrentJobs configuration.
This causes the handling thread on the StorageDaemon to exit and leak the socket. The connection is then closed from the initiating end (director) after hitting its hardcoded timeout auf 10 minutes.

Since this check was introduced in 15.2.3 every version since then should be affected by this behavior. We can only confirm this behavior for 16.2.4 though since that is what we used to test and debug it.

I've added a patch for review.
(0002615)
joergs (administrator)
2017-03-21 11:43

Thank you for the patch. It is added to our intern integration system. It will be pushed to github soon.
(0002624)
joergs (administrator)
2017-04-10 12:53

Fix committed to bareos master branch with changesetid 6944.
(0002652)
joergs (administrator)
2017-06-08 15:23

Fix committed to bareos bareos-15.2 branch with changesetid 7023.

- Related Changesets
bareos: master ef9eb60f
Timestamp: 2017-03-20 12:03:24
Author: joergs
Ported: N/A
Details ] Diff ]
bareos-sd: close network connection if job isn't started

If the Director tries to start a new job on the SD, the SD checks,
if the MaximumConcurrentJobs criteria meet. If not, no new job is
started.
This fix will close the network connection.

Fixes 0000763: some Backups fail after 605 seconds in a schedule of 170 server
mod - src/stored/dir_cmd.c Diff ] File ]
bareos: bareos-15.2 adfecb42
Timestamp: 2017-03-20 12:03:24
Author: joergs
Ported: N/A
Details ] Diff ]
bareos-sd: close network connection if job isn't started

If the Director tries to start a new job on the SD, the SD checks,
if the MaximumConcurrentJobs criteria meet. If not, no new job is
started.
This fix will close the network connection.

Fixes 0000763: some Backups fail after 605 seconds in a schedule of 170 server
mod - src/stored/dir_cmd.c Diff ] File ]

- Issue History
Date Modified Username Field Change
2017-01-11 15:26 robhost New Issue
2017-01-31 10:08 robhost Note Added: 0002541
2017-01-31 10:15 robhost Note Edited: 0002541 View Revisions
2017-01-31 10:49 robhost Note Edited: 0002541 View Revisions
2017-01-31 10:52 robhost Note Edited: 0002541 View Revisions
2017-01-31 11:19 robhost Note Edited: 0002541 View Revisions
2017-01-31 12:14 dirkd-robhost File Added: 0001-fix-director-connections-not-closing-on-job-slot-exhaustion.patch
2017-01-31 12:14 dirkd-robhost Note Added: 0002542
2017-03-17 14:09 joergs bareos-master: impact => yes
2017-03-17 14:09 joergs bareos-master: action => will care
2017-03-17 14:09 joergs bareos-16.2: impact => yes
2017-03-17 14:09 joergs bareos-16.2: action => will care
2017-03-17 14:09 joergs bareos-15.2: impact => yes
2017-03-17 14:09 joergs bareos-14.2: impact => no
2017-03-17 14:09 joergs bareos-13.2: impact => no
2017-03-17 14:09 joergs bareos-12.4: impact => no
2017-03-17 14:12 joergs Assigned To => joergs
2017-03-17 14:12 joergs Status new => assigned
2017-03-20 11:59 joergs Relationship added child of 0000794
2017-03-20 13:08 joergs Relationship added child of 0000798
2017-03-21 11:43 joergs Note Added: 0002615
2017-04-10 12:53 joergs Changeset attached => bareos master ef9eb60f
2017-04-10 12:53 joergs Note Added: 0002624
2017-04-10 12:53 joergs Status assigned => resolved
2017-04-10 12:53 joergs Resolution open => fixed
2017-06-08 15:23 joergs Changeset attached => bareos bareos-15.2 adfecb42
2017-06-08 15:23 joergs Note Added: 0002652


Copyright © 2000 - 2017 MantisBT Team
Powered by Mantis Bugtracker