View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0000909||bareos-core||[All Projects] director||public||2018-02-12 09:50||2019-10-18 11:37|
|Platform||Intel||OS||Debian Jessie||OS Version||8|
|Fixed in Version|
|Summary||0000909: "Reschedule on error" recognized, but not actually rescheduling the job|
|Description||I have been testing my backup's ability to recover from an error.|
I have a job that has the following settings...
Reschedule Interval = 1 minute
Reschedule On Error = yes
Reschedule Times = 5
... and I start it as a full. I then restart the Bareos director (to error out the job intentionally).
In the log, it shows that the job has been rescheduled - but the job never starts. The job should have started at 10:20. But by 10:26 there was nothing running being reported by "list jobs" in bconsole.
=== LOG ===
01-Feb 10:19 server-dir JobId 569: Fatal error: Network error with FD during Backup: ERR=No data available
01-Feb 10:19 server-sd JobId 569: Fatal error: append.c:245 Network error reading from FD. ERR=No data available
01-Feb 10:19 server-sd JobId 569: Elapsed time=00:01:24, Transfer rate=112.6 M Bytes/second
01-Feb 10:19 server-dir JobId 569: Error: Director's comm line to SD dropped.
01-Feb 10:19 server-dir JobId 569: Fatal error: No Job status returned from FD.
01-Feb 10:19 server-dir JobId 569: Error: Bareos server-dir 17.2.4 (21Sep17):
Build OS: x86_64-pc-linux-gnu debian Debian GNU/Linux 8.0 (jessie)
Backup Level: Full
Client: "server-fd" 17.2.4 (21Sep17) x86_64-pc-linux-gnu,debian,Debian GNU/Linux 8.0 (jessie),Debian_8.0,x86_64
FileSet: "backupJobName" 2018-01-29 15:00:00
Pool: "6mo-Full" (From Job FullPool override)
Catalog: "MyCatalog" (From Client resource)
Storage: "Tape" (From Job resource)
Scheduled time: 01-Feb-2018 10:18:10
Start time: 01-Feb-2018 10:18:14
End time: 01-Feb-2018 10:19:45
Elapsed time: 1 min 31 secs
FD Files Written: 0
SD Files Written: 0
FD Bytes Written: 0 (0 B)
SD Bytes Written: 1,042 (1.042 KB)
Rate: 0.0 KB/s
Software Compression: None
Volume name(s): DL011BL7
Volume Session Id: 1
Volume Session Time: 1517476667
Last Volume Bytes: 5,035,703,887,872 (5.035 TB)
Non-fatal FD errors: 2
SD Errors: 0
FD termination status: Error
SD termination status: Error
Termination: *** Backup Error ***
01-Feb 10:19 server-dir JobId 569: Rescheduled Job backupJobName.2018-02-01_10.18.12_04 at 01-Feb-2018 10:19 to re-run in 60 seconds (01-Feb-2018 10:20).
01-Feb 10:19 server-dir JobId 569: Job backupJobName.2018-02-01_10.18.12_04 waiting 60 seconds for scheduled start time.
|Steps To Reproduce||I have scheduled a job with "reschedule on error"|
I have both started the job manually, and let t he schedule start the job through the scheduler
I have tried killing the job BOTH by killing the core Bareos process with "kill -9" AND by simply restarting bareos with the restart commands.
Regardless of the method to kill the job, the log recognizes the job ended on an error, and states it is rescheduling the job (in 60 seconds).
However, the job never actually restarts.
|Additional Information||See the main issue description for the log data|
|Tags||No tags attached.|
Reschedule on error is not intended to cover Bareos Director restart.
However, it should work if you restart the fd.
Can we reopen this. I never got notification that it was in progress.
So, is it indicative of a problem when the log TRIES to reschedule the job - but simply doesn't?
OK. It DID work when I killed the fd.
However, can you tell me what sorts of errors WILL trigger a restart (I don't see that in the manual). We're not just concerned with file errors, but also...
- Tape drive failure.
- Accidental system restart or server power failure.
- OS crash or hang.
- Daemon crashes or hangs.
|This can be marked as resolved|
|How was this resolved? Is there some kind of documentation by now?|
I don't see what kind of documentation you expect?
Reschedule on error does not work for a director restart (and was never intended to do this).
Its purpose is to rerun a job that failed.
So what else do you need?
Well I did not reproduce this but the Log of rightmirem says that the job terminated with an error and therefore it should be rescheduled as far as I understand. https://docs.bareos.org/Configuration/Director.html?highlight=reschedule#config-Dir_Job_RescheduleOnError
Although I see that this feature is not intended to cover director crashes the documentation should mention on what kind of failure a user can expect a reschedule and what does not trigger one (like the already mentioned 'Cancel')
However the log should never report that a job is rescheduled if that one is not going to be executed.
The problem is probably that the director does not have a persistent schedule.
So when the job is rescheduled (and the reschedule log message is written) and then the director is restarted, the scheduling information is lost during the restart.
With the current design of rescheduling this cannot be fixed.
However, we can document that limitation.
|Thanks for the info! I would appreciate if the doc explains that behaviour and as far as I'm concerned this ticket can be closed then.|
|2018-02-12 09:50||rightmirem||New Issue|
|2018-02-12 18:30||joergs||Note Added: 0002908|
|2018-02-20 15:01||joergs||Status||new => closed|
|2018-02-20 15:01||joergs||Resolution||open => no change required|
|2018-02-20 15:16||rightmirem||Note Added: 0002932|
|2018-02-20 15:16||rightmirem||Status||closed => feedback|
|2018-02-20 15:16||rightmirem||Resolution||no change required => reopened|
|2018-02-20 15:23||rightmirem||Note Added: 0002933|
|2018-02-20 15:23||rightmirem||Status||feedback => new|
|2018-03-13 12:16||rightmirem||Note Added: 0002945|
|2019-10-14 14:email@example.com||Note Added: 0003597|
|2019-10-16 10:08||arogge||Assigned To||=> arogge|
|2019-10-16 10:08||arogge||Status||new => feedback|
|2019-10-16 10:08||arogge||Note Added: 0003600|
|2019-10-18 11:firstname.lastname@example.org||Note Added: 0003606|
|2019-10-18 11:32||arogge||Note Added: 0003607|
|2019-10-18 11:email@example.com||Note Added: 0003608|