0001603: Labelling of fresh LTO-9 tapes fails with timeout error - Bareos Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0001603	bareos-core	director	public	2024-02-26 13:16	2024-03-20 15:01

Reporter	Int	Assigned To	bruno-at-bareos
Priority	normal	Severity	major	Reproducibility	always
Status	resolved	Resolution	fixed
Platform	Linux	OS	RHEL (and clones)	OS Version	8
Product Version	23.0.1

Summary	0001603: Labelling of fresh LTO-9 tapes fails with timeout error
Description	Since fresh LTO-9 tapes need to be calibrated by the tape drive on first load (which can take up to 2 hours - see https://www.quantum.com/globalassets/products/tape-storage-new/lto-9/lto-9-quantum-faq-092021.pdf) the labelling command fails with ERR=Child died from signal 15: Termination. Results=Program killed by BAREOS (timeout)
Steps To Reproduce	run command *label storage=Autochanger barcodes slot=11,12,13
Additional Information	full output: label storage=Autochanger barcodes slot=11,12,13 Connecting to Storage daemon Autochanger at 192.168.124.209:9103 ... 3306 Issuing autochanger "list" command. The following Volumes will be labeled: Slot Volume ============== 11 NSL140L9 12 NSL141L9 13 NSL142L9 Do you want to label these Volumes? (yes\|no): yes ... Connecting to Storage daemon Autochanger at 192.168.124.209:9103 ... Sending label command for Volume "NSL140L9" Slot 11 ... 3304 Issuing autochanger "load slot 11, drive 0" command. 3992 Bad autochanger "load slot 11, drive 0": ERR=Child died from signal 15: Termination. Results=Program killed by BAREOS (timeout) Label command failed for Volume NSL140L9. Sending label command for Volume "NSL141L9" Slot 12 ... 3307 Issuing autochanger "unload slot 11, drive 0" command. 3995 Bad autochanger "unload slot 11, drive 0": ERR=Child exited with code 1 Results=Unloading drive 0 into Storage Element 11...mtx: Request Sense: Long Report=yes mtx: Request Sense: Valid Residual=no mtx: Request Sense: Error Code=70 (Current) mtx: Request Sense: Sense Key=Aborted Command mtx: Request Sense: FileMark=no mtx: Request Sense: EOM=no mtx: Request Sense: ILI=no mtx: Request Sense: Additional Sense Code = 29 mtx: Request Sense: Additional Sense Qualifier = 07 mtx: Request Sense: BPV=no mtx: Request Sense: Error in CDB=no mtx: Request Sense: SKSV=no MOVE MEDIUM from Element Address 32 to 266 Failed Label command failed for Volume NSL141L9. Sending label command for Volume "NSL142L9" Slot 13 ... 3991 Bad autochanger "loaded? drive 0" command: ERR=Child died from signal 15: Termination. Results=Program killed by BAREOS (timeout) 3307 Issuing autochanger "unload slot 11, drive 0" command. 3995 Bad autochanger "unload slot 11, drive 0": ERR=Child exited with code 1 Results=Unloading drive 0 into Storage Element 11...mtx: Request Sense: Long Report=yes mtx: Request Sense: Valid Residual=no mtx: Request Sense: Error Code=70 (Current) mtx: Request Sense: Sense Key=Not Ready mtx: Request Sense: FileMark=no mtx: Request Sense: EOM=no mtx: Request Sense: ILI=no mtx: Request Sense: Additional Sense Code = 04 mtx: Request Sense: Additional Sense Qualifier = 01 mtx: Request Sense: BPV=no mtx: Request Sense: Error in CDB=no mtx: Request Sense: SKSV=no MOVE MEDIUM from Element Address 32 to 266 Failed Label command failed for Volume NSL142L9.
Tags	No tags attached.

bruno-at-bareos 2024-02-27 10:04 manager ~0005810	Maybe adjusting the mtx-changer.conf value of # Set to amount of time in seconds to wait after a load load_sleep=0 and also may it is needed to hack the mtx-changer script itself to add more time while [ $i -le 300 ]; do # Wait max 300 seconds

Int 2024-02-27 11:24 reporter ~0005811	I decided against changing "load_sleep" as this would affect all tape loads, but a longer timeout is only needed on the first load. If every tape load would have a delay of 2 hours the backup process would be very tedious. I modified the wait_for_drive() function in the mtx-changer script instead: wait_for_drive() { i=0 while [ $i -le 8000 ]; do # Wait max 2.22 hours - LTO-9 tapes need 2 hours calibration on first load debug "Doing mt -f $1 status ..." drivestatus=$(mt -f "$1" status 2>&1) if echo "${drivestatus}" \| grep "${ready}" >/dev/null 2>&1; then break fi debug "${drivestatus}" debug "Device $1 - not ready, retrying ..." sleep 100 #was 'sleep 1' - do not poll the drive so often i=`expr $i + 100` done } but this didn't work. I ran into the same error: Sending label command for Volume "NSL142L9" Slot 13 ... 3304 Issuing autochanger "load slot 13, drive 0" command. 3992 Bad autochanger "load slot 13, drive 0": ERR=Child died from signal 15: Termination. Results=Program killed by BAREOS (timeout) Label command failed for Volume NSL142L9. The problem behind this is that the wait inside the wait_for_drive() function has no effect as the call to "mt -f /dev/nsa0 status" does not return at all while the tape drive is calibrating the LTO-9 tape. So even the original wait of 300 seconds would not have elapsed as the call of "mt -f /dev/nsa0 status" never returned. There seems to be another timeout somewhere kicking in that kills the label command.

bruno-at-bareos 2024-03-20 15:00 manager ~0005861	We will normally introduce a new parameter for this timeout see the proposal in https://github.com/bareos/bareos/pull/1740 You may want to test directly the script proposed in the PR

bruno-at-bareos 2024-03-20 15:01 manager ~0005862	a new parameter max_wait_drive=300 is proposed in the configuration file and used by the script.

Date Modified	Username	Field	Change
2024-02-26 13:16	Int	New Issue
2024-02-27 10:04	bruno-at-bareos	Note Added: 0005810
2024-02-27 11:24	Int	Note Added: 0005811
2024-03-20 15:00	bruno-at-bareos	Note Added: 0005861
2024-03-20 15:00	bruno-at-bareos	Assigned To	=> bruno-at-bareos
2024-03-20 15:00	bruno-at-bareos	Status	new => assigned
2024-03-20 15:01	bruno-at-bareos	Status	assigned => resolved
2024-03-20 15:01	bruno-at-bareos	Resolution	open => fixed
2024-03-20 15:01	bruno-at-bareos	Note Added: 0005862