0000763: some Backups fail after 605 seconds in a schedule of 170 server

ID	Project	Category	View Status	Date Submitted	Last Update

0000763	bareos-core	director	public	2017-01-11 15:26	2019-12-18 15:25

Reporter	robhost	Assigned To	joergs
Priority	high	Severity	major	Reproducibility	always
Status	closed	Resolution	fixed
Platform	x86_64	OS	CentOS	OS Version	7
Product Version	16.2.4

Summary	0000763: some Backups fail after 605 seconds in a schedule of 170 server
Description	Some Backups in a schedule of 170 server fail after 605 seconds (Fullbackup). It seems that the director can not open any more connection to the Clients or the Storage Daemon (separate server). In our setup are running 10 jobs parallel with a bandwidth of 800 mb/s (Maximum 3GB/s). Log extract from a full backup: 09-Jan 11:46 backup JobId 26: Fatal error: Director unable to authenticate with Storage daemon at "storage.daemon.com:9103". Possible causes: Passwords or names not the same or TLS negotiation problem or Maximum Concurrent Jobs exceeded on the SD or SD networking messed up (restart daemon). Please see http://doc.bareos.org/master/html/bareos-manual-main-reference.html#AuthorizationErrors for help. 09-Jan 11:46 backup director JobId 26: Error: Bareos backup director 16.2.4 (01Jul16): Build OS: x86_64-redhat-linux-gnu redhat Red Hat Enterprise Linux Server release 7.0 (Maipo) JobId: 26 Job: SERVER.2017-01-09_11.00.00_29 Backup Level: Full (upgraded from Differential) Client: "SERVER" FileSet: "SERVER" 2017-01-09 11:00:00 Pool: "SERVER" (From Job resource) Catalog: "MyCatalog" (From Client resource) Storage: "SERVER" (From Job resource) Scheduled time: 09-Jan-2017 11:00:00 Start time: 09-Jan-2017 11:36:38 End time: 09-Jan-2017 11:46:43 Elapsed time: 10 mins 5 secs Priority: 10 FD Files Written: 0 SD Files Written: 0 FD Bytes Written: 0 (0 B) SD Bytes Written: 0 (0 B) Rate: 0.0 KB/s Software Compression: None VSS: no Encryption: no Accurate: no Volume name(s): Volume Session Id: 0 Volume Session Time: 0 Last Volume Bytes: 0 (0 B) Non-fatal FD errors: 1 SD Errors: 0 FD termination status: SD termination status: FD Secure Erase Cmd: <NULL> SD Secure Erase Cmd: <NULL> Termination: * Backup Error * But when we manually start a backup everything is fine: 06-Jan 12:47 backup director JobId 26: Bareos backup director 16.2.4 (01Jul16): Build OS: x86_64-redhat-linux-gnu redhat Red Hat Enterprise Linux Server release 7.0 (Maipo) JobId: 26 Job: SERVER.2017-01-06_12.00.00_29 Backup Level: Full (upgraded from Differential) Client: "SERVER" 16.2.4 (01Jul16) x86_64-redhat-linux-gnu,redhat,CentOS release 6.6 (Final),CentOS_6,x86_64 FileSet: "SERVER" 2017-01-06 12:00:00 Pool: "SERVER" (From Job resource) Catalog: "MyCatalog" (From Client resource) Storage: "SERVER" (From Job resource) Scheduled time: 06-Jan-2017 12:00:00 Start time: 06-Jan-2017 12:28:28 End time: 06-Jan-2017 12:47:14 Elapsed time: 18 mins 46 secs Priority: 10 FD Files Written: 227,394 SD Files Written: 227,394 FD Bytes Written: 8,556,018,039 (8.556 GB) SD Bytes Written: 8,598,189,206 (8.598 GB) Rate: 7598.6 KB/s Software Compression: 8.1 % (lzfast) VSS: no Encryption: no Accurate: no Volume name(s): SERVER-0030 Volume Session Id: 26 Volume Session Time: 1483698635 Last Volume Bytes: 8,611,541,897 (8.611 GB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK
Steps To Reproduce	We just have to start a new full backup that is scheduled, the same problem happens to differential backups.
Tags	No tags attached.

robhost 2017-01-31 10:08 reporter ~0002541 Last edited: 2017-01-31 11:19	We found the error. If we set the Maximum Concurrent Jobs settings like in the documentation(every number of Maximum Concurrent Jobs is the same FD,SD and Director), the bareos directors create this errors. We set the Maximum Concurrent Jobs on the storage daemon to 10 and the Maximum Concurrent Jobs on the director also to 10. If we now run a full backup of our 170 hosts and we have established 10 jobs to backup the server and 1 job is about to finish but has a delay and the director established already a new job(connection) to start the new backup, it will run in an hardcoded 600 second timeout. The connection is not closed and there are also no retries, managed by the scheduler. So as a work around, we set the SD connections to (Director-Connections + 10) to avoid this behavior. If the connections are all used, it is also not possibly to check the status of a storage (we have multiple storage and devices to paralyze the backup process) via bconsole (status storage=<servername>). This will also end in an timeout of 600 seconds.

dirkd-robhost 2017-01-31 12:14 reporter	0001-fix-director-connections-not-closing-on-job-slot-exhaustion.patch (401 bytes) diff --git a/src/stored/dir_cmd.c b/src/stored/dir_cmd.c index 3c25bea..16e36c0 100644 --- a/src/stored/dir_cmd.c +++ b/src/stored/dir_cmd.c @@ -235,6 +235,7 @@ void handle_director_connection(BSOCK dir) if (!count_running_jobs()) { Emsg0(M_ERROR, 0, _("Number of Jobs exhausted, please increase MaximumConcurrentJobs\n")); + dir->signal(BNET_TERMINATE); return NULL; } 0001-fix-director-connections-not-closing-on-job-slot-exhaustion.patch (401 bytes)

dirkd-robhost 2017-01-31 12:14 reporter ~0002542	We further debugged the problem and discovered that - in addition to the configuration problem described in comment 0000763:0002541 - the extremely long no-op of 10 minutes we're seeing is related to a leaked socket in the StorageDaemon code. The cause of this is https://github.com/bareos/bareos/blob/Release/16.2.4/src/stored/dir_cmd.c#L236-L239 where the socket handling function simply returns without closing the socket when there is no more room for another job due to the configured MaximumConcurrentJobs configuration. This causes the handling thread on the StorageDaemon to exit and leak the socket. The connection is then closed from the initiating end (director) after hitting its hardcoded timeout auf 10 minutes. Since this check was introduced in 15.2.3 every version since then should be affected by this behavior. We can only confirm this behavior for 16.2.4 though since that is what we used to test and debug it. I've added a patch for review.

joergs 2017-03-21 11:43 developer ~0002615	Thank you for the patch. It is added to our intern integration system. It will be pushed to github soon.

joergs 2017-04-10 12:53 developer ~0002624	Fix committed to bareos master branch with changesetid 6944.

joergs 2017-06-08 15:23 developer ~0002652	Fix committed to bareos bareos-15.2 branch with changesetid 7023.

bareos: master ef9eb60f 2017-03-20 13:03 joergs Ported: N/A Details Diff	bareos-sd: close network connection if job isn't started If the Director tries to start a new job on the SD, the SD checks, if the MaximumConcurrentJobs criteria meet. If not, no new job is started. This fix will close the network connection. Fixes 0000763: some Backups fail after 605 seconds in a schedule of 170 server	Affected Issues 0000763
	mod - src/stored/dir_cmd.c	Diff File
bareos: bareos-15.2 adfecb42 2017-03-20 13:03 joergs Ported: N/A Details Diff	bareos-sd: close network connection if job isn't started If the Director tries to start a new job on the SD, the SD checks, if the MaximumConcurrentJobs criteria meet. If not, no new job is started. This fix will close the network connection. Fixes 0000763: some Backups fail after 605 seconds in a schedule of 170 server	Affected Issues 0000763
	mod - src/stored/dir_cmd.c	Diff File

Date Modified	Username	Field	Change
2017-01-11 15:26	robhost	New Issue
2017-01-31 10:08	robhost	Note Added: 0002541
2017-01-31 10:15	robhost	Note Edited: 0002541
2017-01-31 10:49	robhost	Note Edited: 0002541
2017-01-31 10:52	robhost	Note Edited: 0002541
2017-01-31 11:19	robhost	Note Edited: 0002541
2017-01-31 12:14	dirkd-robhost	File Added: 0001-fix-director-connections-not-closing-on-job-slot-exhaustion.patch
2017-01-31 12:14	dirkd-robhost	Note Added: 0002542
2017-03-17 14:12	joergs	Assigned To	=> joergs
2017-03-17 14:12	joergs	Status	new => assigned
2017-03-20 11:59	joergs	Relationship added	child of 0000794
2017-03-20 13:08	joergs	Relationship added	child of 0000798
2017-03-21 11:43	joergs	Note Added: 0002615
2017-04-10 12:53	joergs	Changeset attached	=> bareos master ef9eb60f
2017-04-10 12:53	joergs	Note Added: 0002624
2017-04-10 12:53	joergs	Status	assigned => resolved
2017-04-10 12:53	joergs	Resolution	open => fixed
2017-06-08 15:23	joergs	Changeset attached	=> bareos bareos-15.2 adfecb42
2017-06-08 15:23	joergs	Note Added: 0002652
2019-12-18 15:25	arogge	Status	resolved => closed

Reporting new Issues is disabled, please Report new Issues at https://github.com/bareos/bareos/issues

View Issue Details

Relationships

Activities

Related Changesets

Issue History

child of	0000794	closed	stephand	Release bareos-16.2.6
child of	0000798	closed	maik	Release Bareos 15.2.5