View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0000763 | bareos-core | director | public | 2017-01-11 15:26 | 2019-12-18 15:25 |
Reporter | robhost | Assigned To | joergs | ||
Priority | high | Severity | major | Reproducibility | always |
Status | closed | Resolution | fixed | ||
Platform | x86_64 | OS | CentOS | OS Version | 7 |
Product Version | 16.2.4 | ||||
Summary | 0000763: some Backups fail after 605 seconds in a schedule of 170 server | ||||
Description | Some Backups in a schedule of 170 server fail after 605 seconds (Fullbackup). It seems that the director can not open any more connection to the Clients or the Storage Daemon (separate server). In our setup are running 10 jobs parallel with a bandwidth of 800 mb/s (Maximum 3GB/s). Log extract from a full backup: 09-Jan 11:46 backup JobId 26: Fatal error: Director unable to authenticate with Storage daemon at "storage.daemon.com:9103". Possible causes: Passwords or names not the same or TLS negotiation problem or Maximum Concurrent Jobs exceeded on the SD or SD networking messed up (restart daemon). Please see http://doc.bareos.org/master/html/bareos-manual-main-reference.html#AuthorizationErrors for help. 09-Jan 11:46 backup director JobId 26: Error: Bareos backup director 16.2.4 (01Jul16): Build OS: x86_64-redhat-linux-gnu redhat Red Hat Enterprise Linux Server release 7.0 (Maipo) JobId: 26 Job: SERVER.2017-01-09_11.00.00_29 Backup Level: Full (upgraded from Differential) Client: "SERVER" FileSet: "SERVER" 2017-01-09 11:00:00 Pool: "SERVER" (From Job resource) Catalog: "MyCatalog" (From Client resource) Storage: "SERVER" (From Job resource) Scheduled time: 09-Jan-2017 11:00:00 Start time: 09-Jan-2017 11:36:38 End time: 09-Jan-2017 11:46:43 Elapsed time: 10 mins 5 secs Priority: 10 FD Files Written: 0 SD Files Written: 0 FD Bytes Written: 0 (0 B) SD Bytes Written: 0 (0 B) Rate: 0.0 KB/s Software Compression: None VSS: no Encryption: no Accurate: no Volume name(s): Volume Session Id: 0 Volume Session Time: 0 Last Volume Bytes: 0 (0 B) Non-fatal FD errors: 1 SD Errors: 0 FD termination status: SD termination status: FD Secure Erase Cmd: <NULL> SD Secure Erase Cmd: <NULL> Termination: *** Backup Error *** But when we manually start a backup everything is fine: 06-Jan 12:47 backup director JobId 26: Bareos backup director 16.2.4 (01Jul16): Build OS: x86_64-redhat-linux-gnu redhat Red Hat Enterprise Linux Server release 7.0 (Maipo) JobId: 26 Job: SERVER.2017-01-06_12.00.00_29 Backup Level: Full (upgraded from Differential) Client: "SERVER" 16.2.4 (01Jul16) x86_64-redhat-linux-gnu,redhat,CentOS release 6.6 (Final),CentOS_6,x86_64 FileSet: "SERVER" 2017-01-06 12:00:00 Pool: "SERVER" (From Job resource) Catalog: "MyCatalog" (From Client resource) Storage: "SERVER" (From Job resource) Scheduled time: 06-Jan-2017 12:00:00 Start time: 06-Jan-2017 12:28:28 End time: 06-Jan-2017 12:47:14 Elapsed time: 18 mins 46 secs Priority: 10 FD Files Written: 227,394 SD Files Written: 227,394 FD Bytes Written: 8,556,018,039 (8.556 GB) SD Bytes Written: 8,598,189,206 (8.598 GB) Rate: 7598.6 KB/s Software Compression: 8.1 % (lzfast) VSS: no Encryption: no Accurate: no Volume name(s): SERVER-0030 Volume Session Id: 26 Volume Session Time: 1483698635 Last Volume Bytes: 8,611,541,897 (8.611 GB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK | ||||
Steps To Reproduce | We just have to start a new full backup that is scheduled, the same problem happens to differential backups. | ||||
Tags | No tags attached. | ||||
We found the error. If we set the Maximum Concurrent Jobs settings like in the documentation(every number of Maximum Concurrent Jobs is the same FD,SD and Director), the bareos directors create this errors. We set the Maximum Concurrent Jobs on the storage daemon to 10 and the Maximum Concurrent Jobs on the director also to 10. If we now run a full backup of our 170 hosts and we have established 10 jobs to backup the server and 1 job is about to finish but has a delay and the director established already a new job(connection) to start the new backup, it will run in an hardcoded 600 second timeout. The connection is not closed and there are also no retries, managed by the scheduler. So as a work around, we set the SD connections to (Director-Connections + 10) to avoid this behavior. If the connections are all used, it is also not possibly to check the status of a storage (we have multiple storage and devices to paralyze the backup process) via bconsole (status storage=<servername>). This will also end in an timeout of 600 seconds. |
|
0001-fix-director-connections-not-closing-on-job-slot-exhaustion.patch (401 bytes)
diff --git a/src/stored/dir_cmd.c b/src/stored/dir_cmd.c index 3c25bea..16e36c0 100644 --- a/src/stored/dir_cmd.c +++ b/src/stored/dir_cmd.c @@ -235,6 +235,7 @@ void *handle_director_connection(BSOCK *dir) if (!count_running_jobs()) { Emsg0(M_ERROR, 0, _("Number of Jobs exhausted, please increase MaximumConcurrentJobs\n")); + dir->signal(BNET_TERMINATE); return NULL; } |
|
We further debugged the problem and discovered that - in addition to the configuration problem described in comment 0000763:0002541 - the extremely long no-op of 10 minutes we're seeing is related to a leaked socket in the StorageDaemon code. The cause of this is https://github.com/bareos/bareos/blob/Release/16.2.4/src/stored/dir_cmd.c#L236-L239 where the socket handling function simply returns without closing the socket when there is no more room for another job due to the configured MaximumConcurrentJobs configuration. This causes the handling thread on the StorageDaemon to exit and leak the socket. The connection is then closed from the initiating end (director) after hitting its hardcoded timeout auf 10 minutes. Since this check was introduced in 15.2.3 every version since then should be affected by this behavior. We can only confirm this behavior for 16.2.4 though since that is what we used to test and debug it. I've added a patch for review. |
|
Thank you for the patch. It is added to our intern integration system. It will be pushed to github soon. | |
Fix committed to bareos master branch with changesetid 6944. | |
Fix committed to bareos bareos-15.2 branch with changesetid 7023. | |
bareos: master ef9eb60f 2017-03-20 13:03 Ported: N/A Details Diff |
bareos-sd: close network connection if job isn't started If the Director tries to start a new job on the SD, the SD checks, if the MaximumConcurrentJobs criteria meet. If not, no new job is started. This fix will close the network connection. Fixes 0000763: some Backups fail after 605 seconds in a schedule of 170 server |
Affected Issues 0000763 |
|
mod - src/stored/dir_cmd.c | Diff File | ||
bareos: bareos-15.2 adfecb42 2017-03-20 13:03 Ported: N/A Details Diff |
bareos-sd: close network connection if job isn't started If the Director tries to start a new job on the SD, the SD checks, if the MaximumConcurrentJobs criteria meet. If not, no new job is started. This fix will close the network connection. Fixes 0000763: some Backups fail after 605 seconds in a schedule of 170 server |
Affected Issues 0000763 |
|
mod - src/stored/dir_cmd.c | Diff File |
Date Modified | Username | Field | Change |
---|---|---|---|
2017-01-11 15:26 | robhost | New Issue | |
2017-01-31 10:08 | robhost | Note Added: 0002541 | |
2017-01-31 10:15 | robhost | Note Edited: 0002541 | |
2017-01-31 10:49 | robhost | Note Edited: 0002541 | |
2017-01-31 10:52 | robhost | Note Edited: 0002541 | |
2017-01-31 11:19 | robhost | Note Edited: 0002541 | |
2017-01-31 12:14 | dirkd-robhost | File Added: 0001-fix-director-connections-not-closing-on-job-slot-exhaustion.patch | |
2017-01-31 12:14 | dirkd-robhost | Note Added: 0002542 | |
2017-03-17 14:12 | joergs | Assigned To | => joergs |
2017-03-17 14:12 | joergs | Status | new => assigned |
2017-03-20 11:59 | joergs | Relationship added | child of 0000794 |
2017-03-20 13:08 | joergs | Relationship added | child of 0000798 |
2017-03-21 11:43 | joergs | Note Added: 0002615 | |
2017-04-10 12:53 | joergs | Changeset attached | => bareos master ef9eb60f |
2017-04-10 12:53 | joergs | Note Added: 0002624 | |
2017-04-10 12:53 | joergs | Status | assigned => resolved |
2017-04-10 12:53 | joergs | Resolution | open => fixed |
2017-06-08 15:23 | joergs | Changeset attached | => bareos bareos-15.2 adfecb42 |
2017-06-08 15:23 | joergs | Note Added: 0002652 | |
2019-12-18 15:25 | arogge | Status | resolved => closed |