View Issue Details

IDProjectCategoryView StatusLast Update
0001603bareos-coredirectorpublic2024-03-20 15:01
ReporterInt Assigned Tobruno-at-bareos  
PrioritynormalSeveritymajorReproducibilityalways
Status resolvedResolutionfixed 
PlatformLinuxOSRHEL (and clones)OS Version8
Product Version23.0.1 
Summary0001603: Labelling of fresh LTO-9 tapes fails with timeout error
DescriptionSince fresh LTO-9 tapes need to be calibrated by the tape drive on first load (which can take up to 2 hours - see https://www.quantum.com/globalassets/products/tape-storage-new/lto-9/lto-9-quantum-faq-092021.pdf)
the labelling command fails with

ERR=Child died from signal 15: Termination.
Results=Program killed by BAREOS (timeout)
Steps To Reproducerun command

*label storage=Autochanger barcodes slot=11,12,13
Additional Informationfull output:

*label storage=Autochanger barcodes slot=11,12,13
Connecting to Storage daemon Autochanger at 192.168.124.209:9103 ...
3306 Issuing autochanger "list" command.
The following Volumes will be labeled:
Slot Volume
==============
  11 NSL140L9
  12 NSL141L9
  13 NSL142L9
Do you want to label these Volumes? (yes|no): yes

...

Connecting to Storage daemon Autochanger at 192.168.124.209:9103 ...
Sending label command for Volume "NSL140L9" Slot 11 ...
3304 Issuing autochanger "load slot 11, drive 0" command.
3992 Bad autochanger "load slot 11, drive 0": ERR=Child died from signal 15: Termination.
Results=Program killed by BAREOS (timeout)

Label command failed for Volume NSL140L9.
Sending label command for Volume "NSL141L9" Slot 12 ...
3307 Issuing autochanger "unload slot 11, drive 0" command.
3995 Bad autochanger "unload slot 11, drive 0": ERR=Child exited with code 1
Results=Unloading drive 0 into Storage Element 11...mtx: Request Sense: Long Report=yes
mtx: Request Sense: Valid Residual=no
mtx: Request Sense: Error Code=70 (Current)
mtx: Request Sense: Sense Key=Aborted Command
mtx: Request Sense: FileMark=no
mtx: Request Sense: EOM=no
mtx: Request Sense: ILI=no
mtx: Request Sense: Additional Sense Code = 29
mtx: Request Sense: Additional Sense Qualifier = 07
mtx: Request Sense: BPV=no
mtx: Request Sense: Error in CDB=no
mtx: Request Sense: SKSV=no
MOVE MEDIUM from Element Address 32 to 266 Failed

Label command failed for Volume NSL141L9.
Sending label command for Volume "NSL142L9" Slot 13 ...
3991 Bad autochanger "loaded? drive 0" command: ERR=Child died from signal 15: Termination.
Results=Program killed by BAREOS (timeout)

3307 Issuing autochanger "unload slot 11, drive 0" command.
3995 Bad autochanger "unload slot 11, drive 0": ERR=Child exited with code 1
Results=Unloading drive 0 into Storage Element 11...mtx: Request Sense: Long Report=yes
mtx: Request Sense: Valid Residual=no
mtx: Request Sense: Error Code=70 (Current)
mtx: Request Sense: Sense Key=Not Ready
mtx: Request Sense: FileMark=no
mtx: Request Sense: EOM=no
mtx: Request Sense: ILI=no
mtx: Request Sense: Additional Sense Code = 04
mtx: Request Sense: Additional Sense Qualifier = 01
mtx: Request Sense: BPV=no
mtx: Request Sense: Error in CDB=no
mtx: Request Sense: SKSV=no
MOVE MEDIUM from Element Address 32 to 266 Failed

Label command failed for Volume NSL142L9.
*
TagsNo tags attached.

Activities

bruno-at-bareos

bruno-at-bareos

2024-02-27 10:04

manager   ~0005810

Maybe adjusting the mtx-changer.conf value of

# Set to amount of time in seconds to wait after a load
load_sleep=0

and also may it is needed to hack the mtx-changer script itself to add more time

  while [ $i -le 300 ]; do # Wait max 300 seconds
Int

Int

2024-02-27 11:24

reporter   ~0005811

I decided against changing "load_sleep" as this would affect all tape loads, but a longer timeout is only needed on the first load. If every tape load would have a delay of 2 hours the backup process would be very tedious.

I modified the wait_for_drive() function in the mtx-changer script instead:

wait_for_drive() {
  i=0
  while [ $i -le 8000 ]; do # Wait max 2.22 hours - LTO-9 tapes need 2 hours calibration on first load
    debug "Doing mt -f $1 status ..."
    drivestatus=$(mt -f "$1" status 2>&1)
    if echo "${drivestatus}" | grep "${ready}" >/dev/null 2>&1; then
      break
    fi
    debug "${drivestatus}"
    debug "Device $1 - not ready, retrying ..."
    sleep 100 #was 'sleep 1' - do not poll the drive so often
    i=`expr $i + 100`
  done
}


but this didn't work.
I ran into the same error:

Sending label command for Volume "NSL142L9" Slot 13 ...
3304 Issuing autochanger "load slot 13, drive 0" command.
3992 Bad autochanger "load slot 13, drive 0": ERR=Child died from signal 15: Termination.
Results=Program killed by BAREOS (timeout)

Label command failed for Volume NSL142L9.


The problem behind this is that the wait inside the wait_for_drive() function has no effect as the call to
"mt -f /dev/nsa0 status"
does not return at all while the tape drive is calibrating the LTO-9 tape.
So even the original wait of 300 seconds would not have elapsed as the call of "mt -f /dev/nsa0 status" never returned.

There seems to be another timeout somewhere kicking in that kills the label command.
bruno-at-bareos

bruno-at-bareos

2024-03-20 15:00

manager   ~0005861

We will normally introduce a new parameter for this timeout

see the proposal in https://github.com/bareos/bareos/pull/1740

You may want to test directly the script proposed in the PR
bruno-at-bareos

bruno-at-bareos

2024-03-20 15:01

manager   ~0005862

a new parameter max_wait_drive=300 is proposed in the configuration file and used by the script.

Issue History

Date Modified Username Field Change
2024-02-26 13:16 Int New Issue
2024-02-27 10:04 bruno-at-bareos Note Added: 0005810
2024-02-27 11:24 Int Note Added: 0005811
2024-03-20 15:00 bruno-at-bareos Note Added: 0005861
2024-03-20 15:00 bruno-at-bareos Assigned To => bruno-at-bareos
2024-03-20 15:00 bruno-at-bareos Status new => assigned
2024-03-20 15:01 bruno-at-bareos Status assigned => resolved
2024-03-20 15:01 bruno-at-bareos Resolution open => fixed
2024-03-20 15:01 bruno-at-bareos Note Added: 0005862