View Issue Details

IDProjectCategoryView StatusLast Update
0001421bareos-corestorage daemonpublic2022-03-30 12:14
ReporterDemoFreak Assigned Tobruno-at-bareos  
PrioritynormalSeverityminorReproducibilityalways
Status newResolutionreopened 
Platformx86_64OSOpensuseOS VersionLeap 15.3
Product Version21.0.0 
Summary0001421: MTEOM on LTO-3 fails with Bareos 21, but works on older Bacula
DescriptionAfter migrating a file server, the backup was switched from Bacula 5.2 to Bareos 21.0. Transferring the configuration works flawlessly, everything works as desired except for the tape drive.

Appending another backup on an already used medium fails with
kernel: [586704.090320] st 8:0:0:0: [st0] Sense Key : Medium Error [current].
kernel: [586704.090327] st 8:0:0:0: [st0] Add. Sense: Recorded entity not found
the tape is marked as "Error" in the catalog.

The test with btape results consequently in a problem with EOD (MTEOM). After completing the storage configuration with
Hardware End of Medium = no
Fast Forward Space File = no
appending works, but is extremely slow, as also mentioned in the documentation.

Hardware:
- Fibre Channel: QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA
- Drive 'HP Ultrium 3-SCSI Rev. L63S'

The drive and HBA were transferred from the old system to the new system without any changes.

How can I further isolate the problem?
Does Bareos work differently than Bacula 5.2 regarding EOD?
Tagsstorage MTEOM
bareos-master: impact
bareos-master: action
bareos-19.2: impact
bareos-19.2: action
bareos-18.2: impact
bareos-18.2: action
bareos-17.2: impact
bareos-17.2: action
bareos-16.2: impact
bareos-16.2: action
bareos-15.2: impact
bareos-15.2: action
bareos-14.2: impact
bareos-14.2: action
bareos-13.2: impact
bareos-13.2: action
bareos-12.4: impact
bareos-12.4: action

Activities

DemoFreak

DemoFreak

2022-01-18 18:38

reporter   ~0004478

Last edited: 2022-01-18 23:15

It seems that even the slow (software) method sometimes fails. Here is the corresponding excerpt from the log.

First job on the tape:
17-Jan 11:00 bareos-sd JobId 81: Wrote label to prelabeled Volume "Band4" on device "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst)
...
17-Jan 13:51 bareos-sd JobId 81: Releasing device "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst).
  SD Bytes Written: 344,018,607,484 (344.0 GB)

Second job:
17-Jan 13:54 bareos-sd JobId 83: Volume "Band4" previously written, moving to end of data.
17-Jan 14:39 bareos-sd JobId 83: Ready to append to end of Volume "Band4" at file=65.
...
17-Jan 14:39 bareos-sd JobId 83: Releasing device "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst).
  SD Bytes Written: 140,473,627 (140.4 MB)

Third job:
17-Jan 14:42 bareos-sd JobId 85: Volume "Band4" previously written, moving to end of data.
17-Jan 15:27 bareos-sd JobId 85: Ready to append to end of Volume "Band4" at file=66.
...
17-Jan 15:32 bareos-sd JobId 84: Releasing device "FileStorage" (/home/.bareos/backup).
  SD Bytes Written: 9,954,169,360 (9.954 GB)

Fourth job:
17-Jan 15:33 bareos-sd JobId 87: Volume "Band4" previously written, moving to end of data.
17-Jan 16:20 bareos-sd JobId 87: Ready to append to end of Volume "Band4" at file=68.
...
17-Jan 16:20 bareos-sd JobId 87: Releasing device "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst).
  SD Bytes Written: 141,727,215 (141.7 MB)

Everything works fine up to this point.
The file size on the tape is 5GB (Maximum File Size = 5G). So the next job should be attached at file number 69.

Fifth job:
18-Jan 11:00 bareos-sd JobId 92: Volume "Band4" previously written, moving to end of data.
18-Jan 12:03 bareos-sd JobId 92: Error: Unable to position to end of data on device "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst): ERR=backends/generic_tape_device.cc:496 read error on "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst). ERR=Eingabe-/Ausgabefehler.
18-Jan 12:03 bareos-sd JobId 92: Marking Volume "Band4" in Error in Catalog.

This fails with an input/output error. Possibly no EOD marker was written during the fourth job.

Neither "mtst -f /dev/nst0 eod" nor "echo eod | btape" find EOD, they abort with error and the tape is read to the physical end.
Complete reading of the tape with "echo scanblocks | btape" works absolutely correct up to file number 68, different groups of blocks and one EOF marker each are read. In file number 69 no EOF is read, instead the drive keeps reading until the end of the medium.

...
1 block of 64508 bytes in file 66
2 blocks of 64512 bytes in file 66
1 block of 64508 bytes in file 66
23166 blocks of 64512 bytes in file 67
End of File mark.
43547 blocks of 64512 bytes in file 67
1 block of 64504 bytes in file 67
14277 blocks of 64512 bytes in file 67
1 block of 64506 bytes in file 67
3209 blocks of 64512 bytes in file 67
1 block of 64509 bytes in file 67
8889 blocks of 64512 bytes in file 67
1 block of 64510 bytes in file 67
222 blocks of 64512 bytes in file 67
1 block of 64502 bytes in file 67
1046 blocks of 64512 bytes in file 67
1 block of 33330 bytes in file 68
End of File mark.
2198 blocks of 64512 bytes in file 68
1 block of 35367 bytes in file 69
End of File mark.
(At this point, nothing more happens until the end of the tape. Please note that in the log of btape for whatever reason apparently the first line of a new file and the EOF marker of the previous file are swapped, so the last EOF marker here belongs to file number 68).

Any ideas?

DemoFreak

DemoFreak

2022-01-18 19:25

reporter   ~0004479

Last edited: 2022-01-18 23:14

As an attempt to narrow down the problem, I wrote an EOF marker to file number 69 with mtst:

miraculix:~ # mtst -f /dev/nst0 rewind
miraculix:~ # mtst -f /dev/nst0 status
SCSI 2 tape drive:
File number=0, block number=0, partition=0.
Tape block size 0 bytes. Density code 0x44 (LTO-3).
Soft error count since last status=0
General status bits on (41010000):
 BOT ONLINE IM_REP_EN
miraculix:~ # time mtst -f /dev/nst0 fsf 69

real 0m29.927s
user 0m0.002s
sys 0m0.001s
miraculix:~ # mtst -f /dev/nst0 status
SCSI 2 tape drive:
File number=69, block number=0, partition=0.
Tape block size 0 bytes. Density code 0x44 (LTO-3).
Soft error count since last status=0
General status bits on (81010000):
 EOF ONLINE IM_REP_EN
miraculix:~ # mtst -f /dev/nst0 weof
miraculix:~ # mtst -f /dev/nst0 status
SCSI 2 tape drive:
File number=70, block number=0, partition=0.
Tape block size 0 bytes. Density code 0x44 (LTO-3).
Soft error count since last status=0
General status bits on (81010000):
 EOF ONLINE IM_REP_EN
miraculix:~ # mtst -f /dev/nst0 rewind

Note the extreme difference in required time for spacing forward to file number 69:

miraculix:~ # time echo -e "status\nfsf 69\nstatus\n" | btape TapeStorageLTO3
Tape block granularity is 1024 bytes.
btape: stored/butil.cc:306-0 Using device: "TapeStorageLTO3" for writing.
btape: stored/btape.cc:490-0 open device "TapeStorageLTO3" (/dev/tape/by-id/scsi-350060b00002e85de-nst): OK
* Bareos status: file=0 block=0
 Device status: BOT ONLINE IM_REP_EN file=0 block=0
Device status: TAPE BOT ONLINE IMMREPORT. ERR=
*btape: stored/btape.cc:1774-0 Forward spaced 69 files.
* EOF Bareos status: file=69 block=0
 Device status: EOF ONLINE IM_REP_EN file=69 block=0
Device status: TAPE EOF ONLINE IMMREPORT. ERR=
**
real 48m8.811s
user 0m0.006s
sys 0m0.014s
miraculix:~ #

After writing the EOF marker, btape "scanblocks" works as expected:
...
23166 blocks of 64512 bytes in file 67
End of File mark.
43547 blocks of 64512 bytes in file 67
1 block of 64504 bytes in file 67
14277 blocks of 64512 bytes in file 67
1 block of 64506 bytes in file 67
3209 blocks of 64512 bytes in file 67
1 block of 64509 bytes in file 67
8889 blocks of 64512 bytes in file 67
1 block of 64510 bytes in file 67
222 blocks of 64512 bytes in file 67
1 block of 64502 bytes in file 67
1046 blocks of 64512 bytes in file 67
1 block of 33330 bytes in file 68
End of File mark.
2198 blocks of 64512 bytes in file 68
1 block of 35367 bytes in file 69
End of File mark.
Total files=69, blocks=5495758, bytes = 354,542,114,821

btape "eod" works as well:

*eod
btape: stored/btape.cc:619-0 Moved to end of medium.



All in all, it seems to me that under circumstances that are not yet clear to me, sometimes no EOF is written on the tape.

Where am I wrong here?

DemoFreak

DemoFreak

2022-01-19 01:35

reporter   ~0004480

Starting a migration job on this "repaired" tape triggers two migration worker jobs, the first of them works well, the second fails, and I don't understand why.

First job:
18-Jan 23:29 bareos-sd JobId 98: Volume "Band4" previously written, moving to end of data.
19-Jan 00:17 bareos-sd JobId 98: Ready to append to end of Volume "Band4" at file=69.
19-Jan 00:17 bareos-sd JobId 97: Releasing device "FileStorage" (/home/.bareos/backup).
  SD Bytes Written: 247,515,896 (247.5 MB)


Second job:
19-Jan 00:18 bareos-sd JobId 100: Volume "Band4" previously written, moving to end of data.
19-Jan 01:06 bareos-sd JobId 100: Error: Bareos cannot write on tape Volume "Band4" because:
The number of files mismatch! Volume=69 Catalog=70
19-Jan 01:06 bareos-sd JobId 100: Marking Volume "Band4" in Error in Catalog.

Why does the second job still find the end of the tape at file number 69, although this file was already written in the first job? EOD should be at file number 70, as it is also noted in the catalog.

Where is my error?
bruno-at-bareos

bruno-at-bareos

2022-01-20 17:14

developer   ~0004481

Just a quick note having

Appending another backup on an already used medium fails with
kernel: [586704.090320] st 8:0:0:0: [st0] Sense Key : Medium Error [current].
kernel: [586704.090327] st 8:0:0:0: [st0] Add. Sense: Recorded entity not found

Means hardware trouble, be it the medium (tape) the drive or some other component in the scsi chain.
They are never fun to debug.
DemoFreak

DemoFreak

2022-01-20 17:48

reporter   ~0004482

The hardware is completely unchanged. HBA, drive and tapes are the same. They are even still in the same place, only the HBA is now in a different computer.

To be on the safe side, I will rebuild everything and run a test in the old system. This worked until a week ago for several years completely without problems, but just with Bacula.

I am surprised about the lack of an EOF marker after some migration jobs.
DemoFreak

DemoFreak

2022-02-19 04:21

reporter   ~0004519

Last edited: 2022-02-19 04:23

Sorry, I was unfortunately busy in the meantime, therefore the long response time.

I have just done the test and rebuilt everything in the old system, there it runs as expected completely without problems.

After the renewed change into the new system it runs now however also here perfectly.

So it was probably really a problem with the LC cabling.

So can be closed, thanks for the help.
bruno-at-bareos

bruno-at-bareos

2022-02-21 09:40

developer   ~0004520

Hardware problem.
DemoFreak

DemoFreak

2022-03-30 12:14

reporter   ~0004556

I think I have found the real cause.

I use an after-job script which shuts down the tape drive after the migration. For this I check after 30s waiting time if there are more jobs in the queue and only if there are no more waiting or running jobs, the drive is switched off.

echo "Checking for pending bacula jobs..."

sleep 30

if echo "status dir" | /usr/sbin/bconsole | /usr/bin/grep "^ " | /usr/bin/egrep -q "(is waiting|is running)"; then
        echo "Pending bacula jobs found, leaving tape device alone!"
else
        echo "Switching off tape device..."
        $DEBUG $SISPMCTLBIN -qf 1
fi

Apparently with Bareos, the processing of the jobs is more concurrent than Bacula, because since I temporarily suspended the shutdown of the drive, no more MTEOM errors occur. So I suspect that sometimes the drive was already powered off while the storage daemon was still in the process of writing the last data to the drive. Of course, this also meant that no EOF was written.

Is it possible that the Director reports jobs as finished while the SD is still writing?

Issue History

Date Modified Username Field Change
2022-01-17 17:06 DemoFreak New Issue
2022-01-17 17:06 DemoFreak Tag Attached: storage MTEOM
2022-01-18 18:38 DemoFreak Note Added: 0004478
2022-01-18 19:25 DemoFreak Note Added: 0004479
2022-01-18 19:25 DemoFreak Note Edited: 0004479
2022-01-18 20:25 DemoFreak Note Edited: 0004479
2022-01-18 20:30 DemoFreak Note Edited: 0004479
2022-01-18 22:20 DemoFreak Note Edited: 0004479
2022-01-18 23:14 DemoFreak Note Edited: 0004479
2022-01-18 23:15 DemoFreak Note Edited: 0004478
2022-01-18 23:16 DemoFreak Note View State: 0004479: public
2022-01-19 01:35 DemoFreak Note Added: 0004480
2022-01-20 17:14 bruno-at-bareos Note Added: 0004481
2022-01-20 17:48 DemoFreak Note Added: 0004482
2022-02-19 04:21 DemoFreak Note Added: 0004519
2022-02-19 04:22 DemoFreak Note Edited: 0004519
2022-02-19 04:23 DemoFreak Note Edited: 0004519
2022-02-21 09:40 bruno-at-bareos Assigned To => bruno-at-bareos
2022-02-21 09:40 bruno-at-bareos Status new => resolved
2022-02-21 09:40 bruno-at-bareos Resolution open => no change required
2022-02-21 09:40 bruno-at-bareos Note Added: 0004520
2022-03-30 12:14 DemoFreak Status resolved => new
2022-03-30 12:14 DemoFreak Resolution no change required => reopened
2022-03-30 12:14 DemoFreak Note Added: 0004556