0000909: "Reschedule on error" recognized, but not actually rescheduling the job

ID	Project	Category	View Status	Date Submitted	Last Update

0000909	bareos-core	director	public	2018-02-12 09:50	2023-09-12 16:25

Reporter	rightmirem	Assigned To	bruno-at-bareos
Priority	normal	Severity	major	Reproducibility	always
Status	closed	Resolution	reopened
Platform	Intel	OS	Debian Jessie	OS Version	8

Summary	0000909: "Reschedule on error" recognized, but not actually rescheduling the job
Description	I have been testing my backup's ability to recover from an error. I have a job that has the following settings... Reschedule Interval = 1 minute Reschedule On Error = yes Reschedule Times = 5 ... and I start it as a full. I then restart the Bareos director (to error out the job intentionally). In the log, it shows that the job has been rescheduled - but the job never starts. The job should have started at 10:20. But by 10:26 there was nothing running being reported by "list jobs" in bconsole. === LOG === 01-Feb 10:19 server-dir JobId 569: Fatal error: Network error with FD during Backup: ERR=No data available 01-Feb 10:19 server-sd JobId 569: Fatal error: append.c:245 Network error reading from FD. ERR=No data available 01-Feb 10:19 server-sd JobId 569: Elapsed time=00:01:24, Transfer rate=112.6 M Bytes/second 01-Feb 10:19 server-dir JobId 569: Error: Director's comm line to SD dropped. 01-Feb 10:19 server-dir JobId 569: Fatal error: No Job status returned from FD. 01-Feb 10:19 server-dir JobId 569: Error: Bareos server-dir 17.2.4 (21Sep17): Build OS: x86_64-pc-linux-gnu debian Debian GNU/Linux 8.0 (jessie) JobId: 569 Job: backupJobName.2018-02-01_10.18.12_04 Backup Level: Full Client: "server-fd" 17.2.4 (21Sep17) x86_64-pc-linux-gnu,debian,Debian GNU/Linux 8.0 (jessie),Debian_8.0,x86_64 FileSet: "backupJobName" 2018-01-29 15:00:00 Pool: "6mo-Full" (From Job FullPool override) Catalog: "MyCatalog" (From Client resource) Storage: "Tape" (From Job resource) Scheduled time: 01-Feb-2018 10:18:10 Start time: 01-Feb-2018 10:18:14 End time: 01-Feb-2018 10:19:45 Elapsed time: 1 min 31 secs Priority: 10 FD Files Written: 0 SD Files Written: 0 FD Bytes Written: 0 (0 B) SD Bytes Written: 1,042 (1.042 KB) Rate: 0.0 KB/s Software Compression: None VSS: no Encryption: no Accurate: no Volume name(s): DL011BL7 Volume Session Id: 1 Volume Session Time: 1517476667 Last Volume Bytes: 5,035,703,887,872 (5.035 TB) Non-fatal FD errors: 2 SD Errors: 0 FD termination status: Error SD termination status: Error Termination: * Backup Error * 01-Feb 10:19 server-dir JobId 569: Rescheduled Job backupJobName.2018-02-01_10.18.12_04 at 01-Feb-2018 10:19 to re-run in 60 seconds (01-Feb-2018 10:20). 01-Feb 10:19 server-dir JobId 569: Job backupJobName.2018-02-01_10.18.12_04 waiting 60 seconds for scheduled start time.
Steps To Reproduce	I have scheduled a job with "reschedule on error" I have both started the job manually, and let t he schedule start the job through the scheduler I have tried killing the job BOTH by killing the core Bareos process with "kill -9" AND by simply restarting bareos with the restart commands. Regardless of the method to kill the job, the log recognizes the job ended on an error, and states it is rescheduling the job (in 60 seconds). However, the job never actually restarts.
Additional Information	See the main issue description for the log data
Tags	No tags attached.

joergs 2018-02-12 18:30 developer ~0002908	Reschedule on error is not intended to cover Bareos Director restart. However, it should work if you restart the fd.

rightmirem 2018-02-20 15:16 reporter ~0002932	Can we reopen this. I never got notification that it was in progress. So, is it indicative of a problem when the log TRIES to reschedule the job - but simply doesn't?

rightmirem 2018-02-20 15:23 reporter ~0002933	OK. It DID work when I killed the fd. However, can you tell me what sorts of errors WILL trigger a restart (I don't see that in the manual). We're not just concerned with file errors, but also... - Tape drive failure. - Accidental system restart or server power failure. - OS crash or hang. - Daemon crashes or hangs.

rightmirem 2018-03-13 12:16 reporter ~0002945	This can be marked as resolved

b.braunger@syseleven.de 2019-10-14 14:33 reporter ~0003597	How was this resolved? Is there some kind of documentation by now?

arogge 2019-10-16 10:08 manager ~0003600	I don't see what kind of documentation you expect? Reschedule on error does not work for a director restart (and was never intended to do this). Its purpose is to rerun a job that failed. So what else do you need?

b.braunger@syseleven.de 2019-10-18 11:24 reporter ~0003606	Well I did not reproduce this but the Log of rightmirem says that the job terminated with an error and therefore it should be rescheduled as far as I understand. https://docs.bareos.org/Configuration/Director.html?highlight=reschedule#config-Dir_Job_RescheduleOnError Although I see that this feature is not intended to cover director crashes the documentation should mention on what kind of failure a user can expect a reschedule and what does not trigger one (like the already mentioned 'Cancel') However the log should never report that a job is rescheduled if that one is not going to be executed.

arogge 2019-10-18 11:32 manager ~0003607	The problem is probably that the director does not have a persistent schedule. So when the job is rescheduled (and the reschedule log message is written) and then the director is restarted, the scheduling information is lost during the restart. With the current design of rescheduling this cannot be fixed. However, we can document that limitation.

b.braunger@syseleven.de 2019-10-18 11:37 reporter ~0003608	Thanks for the info! I would appreciate if the doc explains that behaviour and as far as I'm concerned this ticket can be closed then.

bozonius 2019-12-18 21:21 reporter ~0003704	When the director starts, couldn't it load the information from the database to re-discover jobs that have been rescheduled? This should be applicable to the cancel waiting jobs issue (https://bugs.bareos.org/view.php?id=1148) also. I don't see where a complete re-write/re-design is necessary to accomplish either of these features, while adding quite a bit of value to BareOS.

arogge 2019-12-19 10:14 manager ~0003706	What you're describing requires a redesign and probably also a major rewrite of the feature. It would have to work based on the job history in the catalog instead of the way it currently works.

bruno-at-bareos 2023-09-12 16:25 manager ~0005418	Warning in documentation is added in PR1543

Date Modified	Username	Field	Change
2018-02-12 09:50	rightmirem	New Issue
2018-02-12 18:30	joergs	Note Added: 0002908
2018-02-20 15:01	joergs	Status	new => closed
2018-02-20 15:01	joergs	Resolution	open => no change required
2018-02-20 15:16	rightmirem	Note Added: 0002932
2018-02-20 15:16	rightmirem	Status	closed => feedback
2018-02-20 15:16	rightmirem	Resolution	no change required => reopened
2018-02-20 15:23	rightmirem	Note Added: 0002933
2018-02-20 15:23	rightmirem	Status	feedback => new
2018-03-13 12:16	rightmirem	Note Added: 0002945
2019-10-14 14:33	b.braunger@syseleven.de	Note Added: 0003597
2019-10-16 10:08	arogge	Assigned To	=> arogge
2019-10-16 10:08	arogge	Status	new => feedback
2019-10-16 10:08	arogge	Note Added: 0003600
2019-10-18 11:24	b.braunger@syseleven.de	Note Added: 0003606
2019-10-18 11:32	arogge	Note Added: 0003607
2019-10-18 11:37	b.braunger@syseleven.de	Note Added: 0003608
2019-12-18 15:51	arogge	Assigned To	arogge =>
2019-12-18 15:51	arogge	Status	feedback => confirmed
2019-12-18 21:21	bozonius	Note Added: 0003704
2019-12-19 10:14	arogge	Note Added: 0003706
2023-07-31 15:11	bruno-at-bareos	Assigned To	=> bruno-at-bareos
2023-07-31 15:11	bruno-at-bareos	Status	confirmed => assigned
2023-09-12 16:25	bruno-at-bareos	Status	assigned => closed
2023-09-12 16:25	bruno-at-bareos	Note Added: 0005418

Reporting new Issues is disabled, please Report new Issues at https://github.com/bareos/bareos/issues

View Issue Details

Activities

Issue History