View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001603 | bareos-core | director | public | 2024-02-26 13:16 | 2024-03-20 15:01 |
Reporter | Int | Assigned To | bruno-at-bareos | ||
Priority | normal | Severity | major | Reproducibility | always |
Status | resolved | Resolution | fixed | ||
Platform | Linux | OS | RHEL (and clones) | OS Version | 8 |
Product Version | 23.0.1 | ||||
Summary | 0001603: Labelling of fresh LTO-9 tapes fails with timeout error | ||||
Description | Since fresh LTO-9 tapes need to be calibrated by the tape drive on first load (which can take up to 2 hours - see https://www.quantum.com/globalassets/products/tape-storage-new/lto-9/lto-9-quantum-faq-092021.pdf) the labelling command fails with ERR=Child died from signal 15: Termination. Results=Program killed by BAREOS (timeout) | ||||
Steps To Reproduce | run command *label storage=Autochanger barcodes slot=11,12,13 | ||||
Additional Information | full output: *label storage=Autochanger barcodes slot=11,12,13 Connecting to Storage daemon Autochanger at 192.168.124.209:9103 ... 3306 Issuing autochanger "list" command. The following Volumes will be labeled: Slot Volume ============== 11 NSL140L9 12 NSL141L9 13 NSL142L9 Do you want to label these Volumes? (yes|no): yes ... Connecting to Storage daemon Autochanger at 192.168.124.209:9103 ... Sending label command for Volume "NSL140L9" Slot 11 ... 3304 Issuing autochanger "load slot 11, drive 0" command. 3992 Bad autochanger "load slot 11, drive 0": ERR=Child died from signal 15: Termination. Results=Program killed by BAREOS (timeout) Label command failed for Volume NSL140L9. Sending label command for Volume "NSL141L9" Slot 12 ... 3307 Issuing autochanger "unload slot 11, drive 0" command. 3995 Bad autochanger "unload slot 11, drive 0": ERR=Child exited with code 1 Results=Unloading drive 0 into Storage Element 11...mtx: Request Sense: Long Report=yes mtx: Request Sense: Valid Residual=no mtx: Request Sense: Error Code=70 (Current) mtx: Request Sense: Sense Key=Aborted Command mtx: Request Sense: FileMark=no mtx: Request Sense: EOM=no mtx: Request Sense: ILI=no mtx: Request Sense: Additional Sense Code = 29 mtx: Request Sense: Additional Sense Qualifier = 07 mtx: Request Sense: BPV=no mtx: Request Sense: Error in CDB=no mtx: Request Sense: SKSV=no MOVE MEDIUM from Element Address 32 to 266 Failed Label command failed for Volume NSL141L9. Sending label command for Volume "NSL142L9" Slot 13 ... 3991 Bad autochanger "loaded? drive 0" command: ERR=Child died from signal 15: Termination. Results=Program killed by BAREOS (timeout) 3307 Issuing autochanger "unload slot 11, drive 0" command. 3995 Bad autochanger "unload slot 11, drive 0": ERR=Child exited with code 1 Results=Unloading drive 0 into Storage Element 11...mtx: Request Sense: Long Report=yes mtx: Request Sense: Valid Residual=no mtx: Request Sense: Error Code=70 (Current) mtx: Request Sense: Sense Key=Not Ready mtx: Request Sense: FileMark=no mtx: Request Sense: EOM=no mtx: Request Sense: ILI=no mtx: Request Sense: Additional Sense Code = 04 mtx: Request Sense: Additional Sense Qualifier = 01 mtx: Request Sense: BPV=no mtx: Request Sense: Error in CDB=no mtx: Request Sense: SKSV=no MOVE MEDIUM from Element Address 32 to 266 Failed Label command failed for Volume NSL142L9. * | ||||
Tags | No tags attached. | ||||
Maybe adjusting the mtx-changer.conf value of # Set to amount of time in seconds to wait after a load load_sleep=0 and also may it is needed to hack the mtx-changer script itself to add more time while [ $i -le 300 ]; do # Wait max 300 seconds |
|
I decided against changing "load_sleep" as this would affect all tape loads, but a longer timeout is only needed on the first load. If every tape load would have a delay of 2 hours the backup process would be very tedious. I modified the wait_for_drive() function in the mtx-changer script instead: wait_for_drive() { i=0 while [ $i -le 8000 ]; do # Wait max 2.22 hours - LTO-9 tapes need 2 hours calibration on first load debug "Doing mt -f $1 status ..." drivestatus=$(mt -f "$1" status 2>&1) if echo "${drivestatus}" | grep "${ready}" >/dev/null 2>&1; then break fi debug "${drivestatus}" debug "Device $1 - not ready, retrying ..." sleep 100 #was 'sleep 1' - do not poll the drive so often i=`expr $i + 100` done } but this didn't work. I ran into the same error: Sending label command for Volume "NSL142L9" Slot 13 ... 3304 Issuing autochanger "load slot 13, drive 0" command. 3992 Bad autochanger "load slot 13, drive 0": ERR=Child died from signal 15: Termination. Results=Program killed by BAREOS (timeout) Label command failed for Volume NSL142L9. The problem behind this is that the wait inside the wait_for_drive() function has no effect as the call to "mt -f /dev/nsa0 status" does not return at all while the tape drive is calibrating the LTO-9 tape. So even the original wait of 300 seconds would not have elapsed as the call of "mt -f /dev/nsa0 status" never returned. There seems to be another timeout somewhere kicking in that kills the label command. |
|
We will normally introduce a new parameter for this timeout see the proposal in https://github.com/bareos/bareos/pull/1740 You may want to test directly the script proposed in the PR |
|
a new parameter max_wait_drive=300 is proposed in the configuration file and used by the script. | |
Date Modified | Username | Field | Change |
---|---|---|---|
2024-02-26 13:16 | Int | New Issue | |
2024-02-27 10:04 | bruno-at-bareos | Note Added: 0005810 | |
2024-02-27 11:24 | Int | Note Added: 0005811 | |
2024-03-20 15:00 | bruno-at-bareos | Note Added: 0005861 | |
2024-03-20 15:00 | bruno-at-bareos | Assigned To | => bruno-at-bareos |
2024-03-20 15:00 | bruno-at-bareos | Status | new => assigned |
2024-03-20 15:01 | bruno-at-bareos | Status | assigned => resolved |
2024-03-20 15:01 | bruno-at-bareos | Resolution | open => fixed |
2024-03-20 15:01 | bruno-at-bareos | Note Added: 0005862 |