View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001421 | bareos-core | storage daemon | public | 2022-01-17 17:06 | 2024-03-25 15:10 |
Reporter | DemoFreak | Assigned To | bruno-at-bareos | ||
Priority | normal | Severity | minor | Reproducibility | always |
Status | closed | Resolution | reopened | ||
Platform | x86_64 | OS | Opensuse | OS Version | Leap 15.3 |
Product Version | 21.0.0 | ||||
Summary | 0001421: MTEOM on LTO-3 fails with Bareos 21, but works on older Bacula | ||||
Description | After migrating a file server, the backup was switched from Bacula 5.2 to Bareos 21.0. Transferring the configuration works flawlessly, everything works as desired except for the tape drive. Appending another backup on an already used medium fails with kernel: [586704.090320] st 8:0:0:0: [st0] Sense Key : Medium Error [current]. kernel: [586704.090327] st 8:0:0:0: [st0] Add. Sense: Recorded entity not found the tape is marked as "Error" in the catalog. The test with btape results consequently in a problem with EOD (MTEOM). After completing the storage configuration with Hardware End of Medium = no Fast Forward Space File = no appending works, but is extremely slow, as also mentioned in the documentation. Hardware: - Fibre Channel: QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA - Drive 'HP Ultrium 3-SCSI Rev. L63S' The drive and HBA were transferred from the old system to the new system without any changes. How can I further isolate the problem? Does Bareos work differently than Bacula 5.2 regarding EOD? | ||||
Tags | storage MTEOM | ||||
It seems that even the slow (software) method sometimes fails. Here is the corresponding excerpt from the log. First job on the tape: 17-Jan 11:00 bareos-sd JobId 81: Wrote label to prelabeled Volume "Band4" on device "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst) ... 17-Jan 13:51 bareos-sd JobId 81: Releasing device "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst). SD Bytes Written: 344,018,607,484 (344.0 GB) Second job: 17-Jan 13:54 bareos-sd JobId 83: Volume "Band4" previously written, moving to end of data. 17-Jan 14:39 bareos-sd JobId 83: Ready to append to end of Volume "Band4" at file=65. ... 17-Jan 14:39 bareos-sd JobId 83: Releasing device "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst). SD Bytes Written: 140,473,627 (140.4 MB) Third job: 17-Jan 14:42 bareos-sd JobId 85: Volume "Band4" previously written, moving to end of data. 17-Jan 15:27 bareos-sd JobId 85: Ready to append to end of Volume "Band4" at file=66. ... 17-Jan 15:32 bareos-sd JobId 84: Releasing device "FileStorage" (/home/.bareos/backup). SD Bytes Written: 9,954,169,360 (9.954 GB) Fourth job: 17-Jan 15:33 bareos-sd JobId 87: Volume "Band4" previously written, moving to end of data. 17-Jan 16:20 bareos-sd JobId 87: Ready to append to end of Volume "Band4" at file=68. ... 17-Jan 16:20 bareos-sd JobId 87: Releasing device "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst). SD Bytes Written: 141,727,215 (141.7 MB) Everything works fine up to this point. The file size on the tape is 5GB (Maximum File Size = 5G). So the next job should be attached at file number 69. Fifth job: 18-Jan 11:00 bareos-sd JobId 92: Volume "Band4" previously written, moving to end of data. 18-Jan 12:03 bareos-sd JobId 92: Error: Unable to position to end of data on device "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst): ERR=backends/generic_tape_device.cc:496 read error on "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst). ERR=Eingabe-/Ausgabefehler. 18-Jan 12:03 bareos-sd JobId 92: Marking Volume "Band4" in Error in Catalog. This fails with an input/output error. Possibly no EOD marker was written during the fourth job. Neither "mtst -f /dev/nst0 eod" nor "echo eod | btape" find EOD, they abort with error and the tape is read to the physical end. Complete reading of the tape with "echo scanblocks | btape" works absolutely correct up to file number 68, different groups of blocks and one EOF marker each are read. In file number 69 no EOF is read, instead the drive keeps reading until the end of the medium. ... 1 block of 64508 bytes in file 66 2 blocks of 64512 bytes in file 66 1 block of 64508 bytes in file 66 23166 blocks of 64512 bytes in file 67 End of File mark. 43547 blocks of 64512 bytes in file 67 1 block of 64504 bytes in file 67 14277 blocks of 64512 bytes in file 67 1 block of 64506 bytes in file 67 3209 blocks of 64512 bytes in file 67 1 block of 64509 bytes in file 67 8889 blocks of 64512 bytes in file 67 1 block of 64510 bytes in file 67 222 blocks of 64512 bytes in file 67 1 block of 64502 bytes in file 67 1046 blocks of 64512 bytes in file 67 1 block of 33330 bytes in file 68 End of File mark. 2198 blocks of 64512 bytes in file 68 1 block of 35367 bytes in file 69 End of File mark. (At this point, nothing more happens until the end of the tape. Please note that in the log of btape for whatever reason apparently the first line of a new file and the EOF marker of the previous file are swapped, so the last EOF marker here belongs to file number 68). Any ideas? |
|
As an attempt to narrow down the problem, I wrote an EOF marker to file number 69 with mtst: miraculix:~ # mtst -f /dev/nst0 rewind miraculix:~ # mtst -f /dev/nst0 status SCSI 2 tape drive: File number=0, block number=0, partition=0. Tape block size 0 bytes. Density code 0x44 (LTO-3). Soft error count since last status=0 General status bits on (41010000): BOT ONLINE IM_REP_EN miraculix:~ # time mtst -f /dev/nst0 fsf 69 real 0m29.927s user 0m0.002s sys 0m0.001s miraculix:~ # mtst -f /dev/nst0 status SCSI 2 tape drive: File number=69, block number=0, partition=0. Tape block size 0 bytes. Density code 0x44 (LTO-3). Soft error count since last status=0 General status bits on (81010000): EOF ONLINE IM_REP_EN miraculix:~ # mtst -f /dev/nst0 weof miraculix:~ # mtst -f /dev/nst0 status SCSI 2 tape drive: File number=70, block number=0, partition=0. Tape block size 0 bytes. Density code 0x44 (LTO-3). Soft error count since last status=0 General status bits on (81010000): EOF ONLINE IM_REP_EN miraculix:~ # mtst -f /dev/nst0 rewind Note the extreme difference in required time for spacing forward to file number 69: miraculix:~ # time echo -e "status\nfsf 69\nstatus\n" | btape TapeStorageLTO3 Tape block granularity is 1024 bytes. btape: stored/butil.cc:306-0 Using device: "TapeStorageLTO3" for writing. btape: stored/btape.cc:490-0 open device "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst): OK * Bareos status: file=0 block=0 Device status: BOT ONLINE IM_REP_EN file=0 block=0 Device status: TAPE BOT ONLINE IMMREPORT. ERR= *btape: stored/btape.cc:1774-0 Forward spaced 69 files. * EOF Bareos status: file=69 block=0 Device status: EOF ONLINE IM_REP_EN file=69 block=0 Device status: TAPE EOF ONLINE IMMREPORT. ERR= ** real 48m8.811s user 0m0.006s sys 0m0.014s miraculix:~ # After writing the EOF marker, btape "scanblocks" works as expected: ... 23166 blocks of 64512 bytes in file 67 End of File mark. 43547 blocks of 64512 bytes in file 67 1 block of 64504 bytes in file 67 14277 blocks of 64512 bytes in file 67 1 block of 64506 bytes in file 67 3209 blocks of 64512 bytes in file 67 1 block of 64509 bytes in file 67 8889 blocks of 64512 bytes in file 67 1 block of 64510 bytes in file 67 222 blocks of 64512 bytes in file 67 1 block of 64502 bytes in file 67 1046 blocks of 64512 bytes in file 67 1 block of 33330 bytes in file 68 End of File mark. 2198 blocks of 64512 bytes in file 68 1 block of 35367 bytes in file 69 End of File mark. Total files=69, blocks=5495758, bytes = 354,542,114,821 btape "eod" works as well: *eod btape: stored/btape.cc:619-0 Moved to end of medium. All in all, it seems to me that under circumstances that are not yet clear to me, sometimes no EOF is written on the tape. Where am I wrong here? |
|
Starting a migration job on this "repaired" tape triggers two migration worker jobs, the first of them works well, the second fails, and I don't understand why. First job: 18-Jan 23:29 bareos-sd JobId 98: Volume "Band4" previously written, moving to end of data. 19-Jan 00:17 bareos-sd JobId 98: Ready to append to end of Volume "Band4" at file=69. 19-Jan 00:17 bareos-sd JobId 97: Releasing device "FileStorage" (/home/.bareos/backup). SD Bytes Written: 247,515,896 (247.5 MB) Second job: 19-Jan 00:18 bareos-sd JobId 100: Volume "Band4" previously written, moving to end of data. 19-Jan 01:06 bareos-sd JobId 100: Error: Bareos cannot write on tape Volume "Band4" because: The number of files mismatch! Volume=69 Catalog=70 19-Jan 01:06 bareos-sd JobId 100: Marking Volume "Band4" in Error in Catalog. Why does the second job still find the end of the tape at file number 69, although this file was already written in the first job? EOD should be at file number 70, as it is also noted in the catalog. Where is my error? |
|
Just a quick note having Appending another backup on an already used medium fails with kernel: [586704.090320] st 8:0:0:0: [st0] Sense Key : Medium Error [current]. kernel: [586704.090327] st 8:0:0:0: [st0] Add. Sense: Recorded entity not found Means hardware trouble, be it the medium (tape) the drive or some other component in the scsi chain. They are never fun to debug. |
|
The hardware is completely unchanged. HBA, drive and tapes are the same. They are even still in the same place, only the HBA is now in a different computer. To be on the safe side, I will rebuild everything and run a test in the old system. This worked until a week ago for several years completely without problems, but just with Bacula. I am surprised about the lack of an EOF marker after some migration jobs. |
|
Sorry, I was unfortunately busy in the meantime, therefore the long response time. I have just done the test and rebuilt everything in the old system, there it runs as expected completely without problems. After the renewed change into the new system it runs now however also here perfectly. So it was probably really a problem with the LC cabling. So can be closed, thanks for the help. |
|
Hardware problem. | |
I think I have found the real cause. I use an after-job script which shuts down the tape drive after the migration. For this I check after 30s waiting time if there are more jobs in the queue and only if there are no more waiting or running jobs, the drive is switched off. echo "Checking for pending bacula jobs..." sleep 30 if echo "status dir" | /usr/sbin/bconsole | /usr/bin/grep "^ " | /usr/bin/egrep -q "(is waiting|is running)"; then echo "Pending bacula jobs found, leaving tape device alone!" else echo "Switching off tape device..." $DEBUG $SISPMCTLBIN -qf 1 fi Apparently with Bareos, the processing of the jobs is more concurrent than Bacula, because since I temporarily suspended the shutdown of the drive, no more MTEOM errors occur. So I suspect that sometimes the drive was already powered off while the storage daemon was still in the process of writing the last data to the drive. Of course, this also meant that no EOF was written. Is it possible that the Director reports jobs as finished while the SD is still writing? |
|
closing old issues with no activities before migrating to github. we have parameter like preferred mount media which will try to keep the drive as busy as it can. so when a job is declared finished by the director, usually the sd has finished to write to the tape. |
|
Date Modified | Username | Field | Change |
---|---|---|---|
2022-01-17 17:06 | DemoFreak | New Issue | |
2022-01-17 17:06 | DemoFreak | Tag Attached: storage MTEOM | |
2022-01-18 18:38 | DemoFreak | Note Added: 0004478 | |
2022-01-18 19:25 | DemoFreak | Note Added: 0004479 | |
2022-01-18 19:25 | DemoFreak | Note Edited: 0004479 | |
2022-01-18 20:25 | DemoFreak | Note Edited: 0004479 | |
2022-01-18 20:30 | DemoFreak | Note Edited: 0004479 | |
2022-01-18 22:20 | DemoFreak | Note Edited: 0004479 | |
2022-01-18 23:14 | DemoFreak | Note Edited: 0004479 | |
2022-01-18 23:15 | DemoFreak | Note Edited: 0004478 | |
2022-01-18 23:16 | DemoFreak | Note View State: 0004479: public | |
2022-01-19 01:35 | DemoFreak | Note Added: 0004480 | |
2022-01-20 17:14 | bruno-at-bareos | Note Added: 0004481 | |
2022-01-20 17:48 | DemoFreak | Note Added: 0004482 | |
2022-02-19 04:21 | DemoFreak | Note Added: 0004519 | |
2022-02-19 04:22 | DemoFreak | Note Edited: 0004519 | |
2022-02-19 04:23 | DemoFreak | Note Edited: 0004519 | |
2022-02-21 09:40 | bruno-at-bareos | Assigned To | => bruno-at-bareos |
2022-02-21 09:40 | bruno-at-bareos | Status | new => resolved |
2022-02-21 09:40 | bruno-at-bareos | Resolution | open => no change required |
2022-02-21 09:40 | bruno-at-bareos | Note Added: 0004520 | |
2022-03-30 12:14 | DemoFreak | Status | resolved => new |
2022-03-30 12:14 | DemoFreak | Resolution | no change required => reopened |
2022-03-30 12:14 | DemoFreak | Note Added: 0004556 | |
2024-03-25 15:10 | bruno-at-bareos | Note Added: 0005866 | |
2024-03-25 15:10 | bruno-at-bareos | Status | new => closed |