View Issue Details

IDProjectCategoryView StatusLast Update
0000909bareos-coredirectorpublic2023-09-12 16:25
Reporterrightmirem Assigned Tobruno-at-bareos  
PrioritynormalSeveritymajorReproducibilityalways
Status closedResolutionreopened 
PlatformIntelOSDebian JessieOS Version8
Summary0000909: "Reschedule on error" recognized, but not actually rescheduling the job
DescriptionI have been testing my backup's ability to recover from an error.

I have a job that has the following settings...

  Reschedule Interval = 1 minute
  Reschedule On Error = yes
  Reschedule Times = 5

... and I start it as a full. I then restart the Bareos director (to error out the job intentionally).

In the log, it shows that the job has been rescheduled - but the job never starts. The job should have started at 10:20. But by 10:26 there was nothing running being reported by "list jobs" in bconsole.

=== LOG ===
    01-Feb 10:19 server-dir JobId 569: Fatal error: Network error with FD during Backup: ERR=No data available
    01-Feb 10:19 server-sd JobId 569: Fatal error: append.c:245 Network error reading from FD. ERR=No data available
    01-Feb 10:19 server-sd JobId 569: Elapsed time=00:01:24, Transfer rate=112.6 M Bytes/second
    01-Feb 10:19 server-dir JobId 569: Error: Director's comm line to SD dropped.
    01-Feb 10:19 server-dir JobId 569: Fatal error: No Job status returned from FD.
    01-Feb 10:19 server-dir JobId 569: Error: Bareos server-dir 17.2.4 (21Sep17):
      Build OS: x86_64-pc-linux-gnu debian Debian GNU/Linux 8.0 (jessie)
      JobId: 569
      Job: backupJobName.2018-02-01_10.18.12_04
      Backup Level: Full
      Client: "server-fd" 17.2.4 (21Sep17) x86_64-pc-linux-gnu,debian,Debian GNU/Linux 8.0 (jessie),Debian_8.0,x86_64
      FileSet: "backupJobName" 2018-01-29 15:00:00
      Pool: "6mo-Full" (From Job FullPool override)
      Catalog: "MyCatalog" (From Client resource)
      Storage: "Tape" (From Job resource)
      Scheduled time: 01-Feb-2018 10:18:10
      Start time: 01-Feb-2018 10:18:14
      End time: 01-Feb-2018 10:19:45
      Elapsed time: 1 min 31 secs
      Priority: 10
      FD Files Written: 0
      SD Files Written: 0
      FD Bytes Written: 0 (0 B)
      SD Bytes Written: 1,042 (1.042 KB)
      Rate: 0.0 KB/s
      Software Compression: None
      VSS: no
      Encryption: no
      Accurate: no
      Volume name(s): DL011BL7
      Volume Session Id: 1
      Volume Session Time: 1517476667
      Last Volume Bytes: 5,035,703,887,872 (5.035 TB)
      Non-fatal FD errors: 2
      SD Errors: 0
      FD termination status: Error
      SD termination status: Error
      Termination: *** Backup Error ***

    01-Feb 10:19 server-dir JobId 569: Rescheduled Job backupJobName.2018-02-01_10.18.12_04 at 01-Feb-2018 10:19 to re-run in 60 seconds (01-Feb-2018 10:20).
    01-Feb 10:19 server-dir JobId 569: Job backupJobName.2018-02-01_10.18.12_04 waiting 60 seconds for scheduled start time.


Steps To ReproduceI have scheduled a job with "reschedule on error"

I have both started the job manually, and let t he schedule start the job through the scheduler

I have tried killing the job BOTH by killing the core Bareos process with "kill -9" AND by simply restarting bareos with the restart commands.

Regardless of the method to kill the job, the log recognizes the job ended on an error, and states it is rescheduling the job (in 60 seconds).

However, the job never actually restarts.
Additional InformationSee the main issue description for the log data
TagsNo tags attached.

Activities

joergs

joergs

2018-02-12 18:30

developer   ~0002908

Reschedule on error is not intended to cover Bareos Director restart.
However, it should work if you restart the fd.
rightmirem

rightmirem

2018-02-20 15:16

reporter   ~0002932

Can we reopen this. I never got notification that it was in progress.

So, is it indicative of a problem when the log TRIES to reschedule the job - but simply doesn't?
rightmirem

rightmirem

2018-02-20 15:23

reporter   ~0002933

OK. It DID work when I killed the fd.

However, can you tell me what sorts of errors WILL trigger a restart (I don't see that in the manual). We're not just concerned with file errors, but also...

- Tape drive failure.
- Accidental system restart or server power failure.
- OS crash or hang.
- Daemon crashes or hangs.
rightmirem

rightmirem

2018-03-13 12:16

reporter   ~0002945

This can be marked as resolved
b.braunger@syseleven.de

b.braunger@syseleven.de

2019-10-14 14:33

reporter   ~0003597

How was this resolved? Is there some kind of documentation by now?
arogge

arogge

2019-10-16 10:08

manager   ~0003600

I don't see what kind of documentation you expect?
Reschedule on error does not work for a director restart (and was never intended to do this).
Its purpose is to rerun a job that failed.

So what else do you need?
b.braunger@syseleven.de

b.braunger@syseleven.de

2019-10-18 11:24

reporter   ~0003606

Well I did not reproduce this but the Log of rightmirem says that the job terminated with an error and therefore it should be rescheduled as far as I understand. https://docs.bareos.org/Configuration/Director.html?highlight=reschedule#config-Dir_Job_RescheduleOnError
Although I see that this feature is not intended to cover director crashes the documentation should mention on what kind of failure a user can expect a reschedule and what does not trigger one (like the already mentioned 'Cancel')
However the log should never report that a job is rescheduled if that one is not going to be executed.
arogge

arogge

2019-10-18 11:32

manager   ~0003607

The problem is probably that the director does not have a persistent schedule.
So when the job is rescheduled (and the reschedule log message is written) and then the director is restarted, the scheduling information is lost during the restart.
With the current design of rescheduling this cannot be fixed.

However, we can document that limitation.
b.braunger@syseleven.de

b.braunger@syseleven.de

2019-10-18 11:37

reporter   ~0003608

Thanks for the info! I would appreciate if the doc explains that behaviour and as far as I'm concerned this ticket can be closed then.
bozonius

bozonius

2019-12-18 21:21

reporter   ~0003704

When the director starts, couldn't it load the information from the database to re-discover jobs that have been rescheduled?

This should be applicable to the cancel waiting jobs issue (https://bugs.bareos.org/view.php?id=1148) also.

I don't see where a complete re-write/re-design is necessary to accomplish either of these features, while adding quite a bit of value to BareOS.
arogge

arogge

2019-12-19 10:14

manager   ~0003706

What you're describing requires a redesign and probably also a major rewrite of the feature. It would have to work based on the job history in the catalog instead of the way it currently works.
bruno-at-bareos

bruno-at-bareos

2023-09-12 16:25

manager   ~0005418

Warning in documentation is added in PR1543

Issue History

Date Modified Username Field Change
2018-02-12 09:50 rightmirem New Issue
2018-02-12 18:30 joergs Note Added: 0002908
2018-02-20 15:01 joergs Status new => closed
2018-02-20 15:01 joergs Resolution open => no change required
2018-02-20 15:16 rightmirem Note Added: 0002932
2018-02-20 15:16 rightmirem Status closed => feedback
2018-02-20 15:16 rightmirem Resolution no change required => reopened
2018-02-20 15:23 rightmirem Note Added: 0002933
2018-02-20 15:23 rightmirem Status feedback => new
2018-03-13 12:16 rightmirem Note Added: 0002945
2019-10-14 14:33 b.braunger@syseleven.de Note Added: 0003597
2019-10-16 10:08 arogge Assigned To => arogge
2019-10-16 10:08 arogge Status new => feedback
2019-10-16 10:08 arogge Note Added: 0003600
2019-10-18 11:24 b.braunger@syseleven.de Note Added: 0003606
2019-10-18 11:32 arogge Note Added: 0003607
2019-10-18 11:37 b.braunger@syseleven.de Note Added: 0003608
2019-12-18 15:51 arogge Assigned To arogge =>
2019-12-18 15:51 arogge Status feedback => confirmed
2019-12-18 21:21 bozonius Note Added: 0003704
2019-12-19 10:14 arogge Note Added: 0003706
2023-07-31 15:11 bruno-at-bareos Assigned To => bruno-at-bareos
2023-07-31 15:11 bruno-at-bareos Status confirmed => assigned
2023-09-12 16:25 bruno-at-bareos Status assigned => closed
2023-09-12 16:25 bruno-at-bareos Note Added: 0005418