View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001415 | bareos-core | storage daemon | public | 2021-12-29 18:52 | 2021-12-30 13:57 |
Reporter | tdl | Assigned To | arogge | ||
Priority | high | Severity | crash | Reproducibility | have not tried |
Status | closed | Resolution | not fixable | ||
Platform | Linux | OS | Ubuntu | OS Version | 20.04 |
Product Version | 21.0.0 | ||||
Summary | 0001415: task bareos-sd:229464 blocked for more than 120 seconds. Code: Bad RIP value. | ||||
Description | Running backup on LTO4 tape with spooling of 10G, block size of 1M, after about 450G of backup, the LTO drive is in status Writing, but do not seem to progress Note that I am using a fully up to date Ubuntu 20.04 release (choice is missing from drop down in the profile of the bug tracking system). | ||||
Additional Information | 794096.555433] INFO: task bareos-sd:229464 blocked for more than 120 seconds. [794096.555485] Tainted: G I 5.4.0-91-generic 0000102-Ubuntu [794096.555526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [794096.555573] bareos-sd D 0 229464 1 0x00000000 [794096.555576] Call Trace: [794096.555587] __schedule+0x2e3/0x740 [794096.555590] schedule+0x42/0xb0 [794096.555593] schedule_timeout+0x10e/0x160 [794096.555596] wait_for_completion+0xb1/0x120 [794096.555600] ? wake_up_q+0x70/0x70 [794096.555606] write_behind_check+0x45/0x1a0 [st] [794096.555610] st_write+0x103/0xd40 [st] [794096.555614] ? ext4_file_read_iter+0x4a/0x100 [794096.555617] ? common_file_perm+0x5e/0x110 [794096.555622] ? security_file_permission+0x33/0x110 [794096.555626] __vfs_write+0x1b/0x40 [794096.555629] vfs_write+0xb9/0x1a0 [794096.555631] ksys_write+0x67/0xe0 [794096.555633] __x64_sys_write+0x1a/0x20 [794096.555638] do_syscall_64+0x57/0x190 [794096.555640] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [794096.555643] RIP: 0033:0x7f8148ebb2cf [794096.555648] Code: Bad RIP value. [794096.555650] RSP: 002b:00007f81484a6650 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [794096.555652] RAX: ffffffffffffffda RBX: 00007f81400038e0 RCX: 00007f8148ebb2cf [794096.555653] RDX: 0000000000100000 RSI: 00007f814011e4c8 RDI: 0000000000000007 [794096.555655] RBP: 00007f81484a6750 R08: 0000000000000000 R09: 00007f81484a6608 [794096.555656] R10: 00007f81484a6600 R11: 0000000000000293 R12: 00007f814011e4c8 [794096.555657] R13: 0000000000100000 R14: 0000000000000000 R15: 0000000000100000 [794459.045225] INFO: task bareos-sd:229464 blocked for more than 120 seconds. [794459.045276] Tainted: G I 5.4.0-91-generic 0000102-Ubuntu [794459.045318] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [794459.045365] bareos-sd D 0 229464 1 0x00000000 [794459.045368] Call Trace: [794459.045378] __schedule+0x2e3/0x740 [794459.045381] schedule+0x42/0xb0 [794459.045384] schedule_timeout+0x10e/0x160 [794459.045388] wait_for_completion+0xb1/0x120 [794459.045391] ? wake_up_q+0x70/0x70 [794459.045397] write_behind_check+0x45/0x1a0 [st] [794459.045401] st_write+0x103/0xd40 [st] [794459.045405] ? ext4_file_read_iter+0x4a/0x100 [794459.045409] ? common_file_perm+0x5e/0x110 [794459.045413] ? security_file_permission+0x33/0x110 [794459.045418] __vfs_write+0x1b/0x40 [794459.045420] vfs_write+0xb9/0x1a0 [794459.045423] ksys_write+0x67/0xe0 [794459.045425] __x64_sys_write+0x1a/0x20 [794459.045429] do_syscall_64+0x57/0x190 [794459.045432] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [794459.045435] RIP: 0033:0x7f8148ebb2cf [794459.045441] Code: Bad RIP value. [794459.045442] RSP: 002b:00007f81484a6650 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [794459.045444] RAX: ffffffffffffffda RBX: 00007f81400038e0 RCX: 00007f8148ebb2cf [794459.045446] RDX: 0000000000100000 RSI: 00007f814011e4c8 RDI: 0000000000000007 [794459.045447] RBP: 00007f81484a6750 R08: 0000000000000000 R09: 00007f81484a6608 [794459.045448] R10: 00007f81484a6600 R11: 0000000000000293 R12: 00007f814011e4c8 [794459.045449] R13: 0000000000100000 R14: 0000000000000000 R15: 0000000000100000 [794700.705087] INFO: task bareos-sd:229464 blocked for more than 120 seconds. [794700.705139] Tainted: G I 5.4.0-91-generic 0000102-Ubuntu [794700.705180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [794700.705227] bareos-sd D 0 229464 1 0x00000000 [794700.705230] Call Trace: [794700.705240] __schedule+0x2e3/0x740 [794700.705243] schedule+0x42/0xb0 [794700.705247] schedule_timeout+0x10e/0x160 [794700.705250] wait_for_completion+0xb1/0x120 [794700.705253] ? wake_up_q+0x70/0x70 [794700.705260] write_behind_check+0x45/0x1a0 [st] [794700.705263] st_write+0x103/0xd40 [st] [794700.705268] ? ext4_file_read_iter+0x4a/0x100 [794700.705271] ? common_file_perm+0x5e/0x110 [794700.705275] ? security_file_permission+0x33/0x110 [794700.705280] __vfs_write+0x1b/0x40 [794700.705283] vfs_write+0xb9/0x1a0 [794700.705285] ksys_write+0x67/0xe0 [794700.705287] __x64_sys_write+0x1a/0x20 [794700.705292] do_syscall_64+0x57/0x190 [794700.705294] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [794700.705297] RIP: 0033:0x7f8148ebb2cf [794700.705302] Code: Bad RIP value. [794700.705304] RSP: 002b:00007f81484a6650 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [794700.705306] RAX: ffffffffffffffda RBX: 00007f81400038e0 RCX: 00007f8148ebb2cf [794700.705308] RDX: 0000000000100000 RSI: 00007f814011e4c8 RDI: 0000000000000007 [794700.705309] RBP: 00007f81484a6750 R08: 0000000000000000 R09: 00007f81484a6608 [794700.705310] R10: 00007f81484a6600 R11: 0000000000000293 R12: 00007f814011e4c8 [794700.705311] R13: 0000000000100000 R14: 0000000000000000 R15: 0000000000100000 [795282.349291] EXT4-fs (dm-30): mounted filesystem with ordered data mode. Opts: (null) [795425.685019] INFO: task bareos-sd:229464 blocked for more than 120 seconds. [795425.685089] Tainted: G I 5.4.0-91-generic 0000102-Ubuntu [795425.685130] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [795425.685177] bareos-sd D 0 229464 1 0x00000000 [795425.685180] Call Trace: [795425.685190] __schedule+0x2e3/0x740 [795425.685193] schedule+0x42/0xb0 [795425.685196] schedule_timeout+0x10e/0x160 [795425.685200] wait_for_completion+0xb1/0x120 [795425.685203] ? wake_up_q+0x70/0x70 [795425.685209] write_behind_check+0x45/0x1a0 [st] [795425.685213] st_write+0x103/0xd40 [st] [795425.685217] ? ext4_file_read_iter+0x4a/0x100 [795425.685221] ? common_file_perm+0x5e/0x110 [795425.685225] ? security_file_permission+0x33/0x110 [795425.685230] __vfs_write+0x1b/0x40 [795425.685232] vfs_write+0xb9/0x1a0 [795425.685234] ksys_write+0x67/0xe0 [795425.685237] __x64_sys_write+0x1a/0x20 [795425.685241] do_syscall_64+0x57/0x190 [795425.685243] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [795425.685246] RIP: 0033:0x7f8148ebb2cf [795425.685252] Code: Bad RIP value. [795425.685253] RSP: 002b:00007f81484a6650 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [795425.685256] RAX: ffffffffffffffda RBX: 00007f81400038e0 RCX: 00007f8148ebb2cf [795425.685257] RDX: 0000000000100000 RSI: 00007f814011e4c8 RDI: 0000000000000007 [795425.685258] RBP: 00007f81484a6750 R08: 0000000000000000 R09: 00007f81484a6608 [795425.685260] R10: 00007f81484a6600 R11: 0000000000000293 R12: 00007f814011e4c8 [795425.685261] R13: 0000000000100000 R14: 0000000000000000 R15: 0000000000100000 [795909.004859] INFO: task bareos-sd:229464 blocked for more than 120 seconds. [795909.004912] Tainted: G I 5.4.0-91-generic 0000102-Ubuntu [795909.004953] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [795909.005000] bareos-sd D 0 229464 1 0x00000000 [795909.005003] Call Trace: [795909.005013] __schedule+0x2e3/0x740 [795909.005017] schedule+0x42/0xb0 [795909.005020] schedule_timeout+0x10e/0x160 [795909.005023] wait_for_completion+0xb1/0x120 [795909.005027] ? wake_up_q+0x70/0x70 [795909.005033] write_behind_check+0x45/0x1a0 [st] [795909.005037] st_write+0x103/0xd40 [st] [795909.005041] ? ext4_file_read_iter+0x4a/0x100 [795909.005044] ? common_file_perm+0x5e/0x110 [795909.005049] ? security_file_permission+0x33/0x110 [795909.005053] __vfs_write+0x1b/0x40 [795909.005056] vfs_write+0xb9/0x1a0 [795909.005058] ksys_write+0x67/0xe0 [795909.005060] __x64_sys_write+0x1a/0x20 [795909.005065] do_syscall_64+0x57/0x190 [795909.005067] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [795909.005070] RIP: 0033:0x7f8148ebb2cf [795909.005076] Code: Bad RIP value. [795909.005077] RSP: 002b:00007f81484a6650 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [795909.005079] RAX: ffffffffffffffda RBX: 00007f81400038e0 RCX: 00007f8148ebb2cf [795909.005081] RDX: 0000000000100000 RSI: 00007f814011e4c8 RDI: 0000000000000007 [795909.005082] RBP: 00007f81484a6750 R08: 0000000000000000 R09: 00007f81484a6608 [795909.005083] R10: 00007f81484a6600 R11: 0000000000000293 R12: 00007f814011e4c8 [795909.005084] R13: 0000000000100000 R14: 0000000000000000 R15: 0000000000100000 | ||||
Tags | No tags attached. | ||||
Hello, thank you for your report. To be useful, you should also add any configuration related to this job, also the full joblog. What was the status of the spooling at that time and the status of the 3 daemons. Is this reproducible ? Is this also reproducible with the stock kernel (seems you have a tainted one) ? What's the output of lsscsi -g could you also confirm on which partition size and fs type the spool is running and which option are used to mount it. Thanks. |
|
What is the output of dmesg? That looks like a broken spooling hard drive or something on the way to the drive being wrong (controller, cable, whatever) | |
Sorry should have been more precise in my report. The files are now attached. The spool was writing to tape. As the storage daemon was stuck for > 30 minutes I had no other option than killing it (probably leaving the tape unreadable) I didn't thought about taking a status for the different components - will do better next time if it re-occurs (I have restarted a full backup, but it crashed after > 3h). So not sure yet it is reproductible. One thing that might be (not sure) of interest, is the fact I enabled the Collect Statistics on the storage. Regarding the kernel I use the standard Ubuntu kernel. It is tainted because of "workaround for bug in platform firmware applied", unfortunately I cannot do a lot here unfortunately I think. The disks are attached to a Perc H700 and the tape library is attached to an mptsas (see sas_dmesg for relevant dmesg) The spool is stored on an EXT4 partition (rw,noatime,stripe=512). It is a lvm logical volume of a VG on mdadm raid 5 (5 partitions + 1 spare) of 2047.5GiB job log.txt (19,446 bytes)
29-Dec 14:30 bareos-dir JobId 28: Start Backup JobId 28, Job=san01-data.2021-12-29_14.30.28_20 29-Dec 14:30 bareos-dir JobId 28: Connected Storage daemon at san01:9103, encryption: TLS_CHACHA20_POLY1305_SHA256 TLSv1.3 29-Dec 14:30 bareos-dir JobId 28: Using Device "tapedrive-0" to write. 29-Dec 14:30 bareos-dir JobId 28: Connected Client: san01-fd at localhost:9102, encryption: TLS_CHACHA20_POLY1305_SHA256 TLSv1.3 29-Dec 14:30 bareos-dir JobId 28: Handshake: Immediate TLS 29-Dec 14:30 bareos-dir JobId 28: Encryption: TLS_CHACHA20_POLY1305_SHA256 TLSv1.3 29-Dec 14:30 san01-fd JobId 28: Connected Storage daemon at san01:9103, encryption: TLS_CHACHA20_POLY1305_SHA256 TLSv1.3 29-Dec 14:30 san01-fd JobId 28: Extended attribute support is enabled 29-Dec 14:30 san01-fd JobId 28: ACL support is enabled 29-Dec 14:30 bareos-sd JobId 28: Spooling data ... 29-Dec 14:32 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,112 MaxJobSpoolSize=10,737,418,240 29-Dec 14:32 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,112 bytes ... 29-Dec 14:34 bareos-sd JobId 28: Despooling elapsed time = 00:01:32, Transfer rate = 116.7 M Bytes/second 29-Dec 14:34 bareos-sd JobId 28: Spooling data again ... 29-Dec 14:36 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,102 MaxJobSpoolSize=10,737,418,240 29-Dec 14:36 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,102 bytes ... 29-Dec 14:37 bareos-sd JobId 28: Despooling elapsed time = 00:01:31, Transfer rate = 117.9 M Bytes/second 29-Dec 14:37 bareos-sd JobId 28: Spooling data again ... 29-Dec 14:39 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,120 MaxJobSpoolSize=10,737,418,240 29-Dec 14:39 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,120 bytes ... 29-Dec 14:41 bareos-sd JobId 28: Despooling elapsed time = 00:01:39, Transfer rate = 108.4 M Bytes/second 29-Dec 14:41 bareos-sd JobId 28: Spooling data again ... 29-Dec 14:43 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,102 MaxJobSpoolSize=10,737,418,240 29-Dec 14:43 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,102 bytes ... 29-Dec 14:45 bareos-sd JobId 28: Despooling elapsed time = 00:01:40, Transfer rate = 107.3 M Bytes/second 29-Dec 14:45 bareos-sd JobId 28: Spooling data again ... 29-Dec 14:49 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,035 MaxJobSpoolSize=10,737,418,240 29-Dec 14:49 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,035 bytes ... 29-Dec 14:50 bareos-sd JobId 28: Despooling elapsed time = 00:01:29, Transfer rate = 120.6 M Bytes/second 29-Dec 14:50 bareos-sd JobId 28: Spooling data again ... 29-Dec 14:54 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,104 MaxJobSpoolSize=10,737,418,240 29-Dec 14:54 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,104 bytes ... 29-Dec 14:55 bareos-sd JobId 28: Despooling elapsed time = 00:01:32, Transfer rate = 116.7 M Bytes/second 29-Dec 14:55 bareos-sd JobId 28: Spooling data again ... 29-Dec 14:57 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,117 MaxJobSpoolSize=10,737,418,240 29-Dec 14:57 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,117 bytes ... 29-Dec 14:59 bareos-sd JobId 28: Despooling elapsed time = 00:01:33, Transfer rate = 115.4 M Bytes/second 29-Dec 14:59 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:00 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,113 MaxJobSpoolSize=10,737,418,240 29-Dec 15:00 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,113 bytes ... 29-Dec 15:02 bareos-sd JobId 28: Despooling elapsed time = 00:01:40, Transfer rate = 107.3 M Bytes/second 29-Dec 15:02 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:03 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,114 MaxJobSpoolSize=10,737,418,240 29-Dec 15:03 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,114 bytes ... 29-Dec 15:05 bareos-sd JobId 28: Despooling elapsed time = 00:01:31, Transfer rate = 117.9 M Bytes/second 29-Dec 15:05 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:06 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,109 MaxJobSpoolSize=10,737,418,240 29-Dec 15:06 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,109 bytes ... 29-Dec 15:08 bareos-sd JobId 28: Despooling elapsed time = 00:01:33, Transfer rate = 115.4 M Bytes/second 29-Dec 15:08 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:09 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,119 MaxJobSpoolSize=10,737,418,240 29-Dec 15:09 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,119 bytes ... 29-Dec 15:11 bareos-sd JobId 28: Despooling elapsed time = 00:01:33, Transfer rate = 115.4 M Bytes/second 29-Dec 15:11 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:12 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,112 MaxJobSpoolSize=10,737,418,240 29-Dec 15:12 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,112 bytes ... 29-Dec 15:14 bareos-sd JobId 28: Despooling elapsed time = 00:01:37, Transfer rate = 110.6 M Bytes/second 29-Dec 15:14 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:15 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,120 MaxJobSpoolSize=10,737,418,240 29-Dec 15:15 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,120 bytes ... 29-Dec 15:17 bareos-sd JobId 28: Despooling elapsed time = 00:01:34, Transfer rate = 114.2 M Bytes/second 29-Dec 15:17 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:19 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,113 MaxJobSpoolSize=10,737,418,240 29-Dec 15:19 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,113 bytes ... 29-Dec 15:20 bareos-sd JobId 28: Despooling elapsed time = 00:01:34, Transfer rate = 114.2 M Bytes/second 29-Dec 15:20 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:22 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,120 MaxJobSpoolSize=10,737,418,240 29-Dec 15:22 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,120 bytes ... 29-Dec 15:23 bareos-sd JobId 28: Despooling elapsed time = 00:01:36, Transfer rate = 111.8 M Bytes/second 29-Dec 15:23 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:25 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,115 MaxJobSpoolSize=10,737,418,240 29-Dec 15:25 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,115 bytes ... 29-Dec 15:26 bareos-sd JobId 28: Despooling elapsed time = 00:01:33, Transfer rate = 115.4 M Bytes/second 29-Dec 15:26 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:28 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,102 MaxJobSpoolSize=10,737,418,240 29-Dec 15:28 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,102 bytes ... 29-Dec 15:29 bareos-sd JobId 28: Despooling elapsed time = 00:01:34, Transfer rate = 114.2 M Bytes/second 29-Dec 15:29 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:31 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,110 MaxJobSpoolSize=10,737,418,240 29-Dec 15:31 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,110 bytes ... 29-Dec 15:32 bareos-sd JobId 28: Despooling elapsed time = 00:01:33, Transfer rate = 115.4 M Bytes/second 29-Dec 15:32 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:42 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,079 MaxJobSpoolSize=10,737,418,240 29-Dec 15:42 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,079 bytes ... 29-Dec 15:44 bareos-sd JobId 28: Despooling elapsed time = 00:01:36, Transfer rate = 111.8 M Bytes/second 29-Dec 15:44 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:45 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,104 MaxJobSpoolSize=10,737,418,240 29-Dec 15:45 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,104 bytes ... 29-Dec 15:47 bareos-sd JobId 28: Despooling elapsed time = 00:01:34, Transfer rate = 114.2 M Bytes/second 29-Dec 15:47 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:48 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,105 MaxJobSpoolSize=10,737,418,240 29-Dec 15:48 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,105 bytes ... 29-Dec 15:50 bareos-sd JobId 28: Despooling elapsed time = 00:01:34, Transfer rate = 114.2 M Bytes/second 29-Dec 15:50 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:52 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,104 MaxJobSpoolSize=10,737,418,240 29-Dec 15:52 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,104 bytes ... 29-Dec 15:53 bareos-sd JobId 28: Despooling elapsed time = 00:01:31, Transfer rate = 117.9 M Bytes/second 29-Dec 15:53 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:54 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,120 MaxJobSpoolSize=10,737,418,240 29-Dec 15:54 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,120 bytes ... 29-Dec 15:56 bareos-sd JobId 28: Despooling elapsed time = 00:01:51, Transfer rate = 96.73 M Bytes/second 29-Dec 15:56 bareos-sd JobId 28: Spooling data again ... 29-Dec 15:58 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,110 MaxJobSpoolSize=10,737,418,240 29-Dec 15:58 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,110 bytes ... 29-Dec 16:00 bareos-sd JobId 28: Despooling elapsed time = 00:01:34, Transfer rate = 114.2 M Bytes/second 29-Dec 16:00 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:02 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,067 MaxJobSpoolSize=10,737,418,240 29-Dec 16:02 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,067 bytes ... 29-Dec 16:03 bareos-sd JobId 28: Despooling elapsed time = 00:01:31, Transfer rate = 117.9 M Bytes/second 29-Dec 16:03 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:06 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,108 MaxJobSpoolSize=10,737,418,240 29-Dec 16:06 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,108 bytes ... 29-Dec 16:07 bareos-sd JobId 28: Despooling elapsed time = 00:01:36, Transfer rate = 111.8 M Bytes/second 29-Dec 16:07 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:09 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,115 MaxJobSpoolSize=10,737,418,240 29-Dec 16:09 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,115 bytes ... 29-Dec 16:11 bareos-sd JobId 28: Despooling elapsed time = 00:01:40, Transfer rate = 107.3 M Bytes/second 29-Dec 16:11 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:13 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,107 MaxJobSpoolSize=10,737,418,240 29-Dec 16:13 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,107 bytes ... 29-Dec 16:15 bareos-sd JobId 28: Despooling elapsed time = 00:01:35, Transfer rate = 113.0 M Bytes/second 29-Dec 16:15 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:17 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,120 MaxJobSpoolSize=10,737,418,240 29-Dec 16:17 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,120 bytes ... 29-Dec 16:18 bareos-sd JobId 28: Despooling elapsed time = 00:01:33, Transfer rate = 115.4 M Bytes/second 29-Dec 16:18 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:20 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,117 MaxJobSpoolSize=10,737,418,240 29-Dec 16:20 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,117 bytes ... 29-Dec 16:22 bareos-sd JobId 28: Despooling elapsed time = 00:01:41, Transfer rate = 106.3 M Bytes/second 29-Dec 16:22 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:24 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,104 MaxJobSpoolSize=10,737,418,240 29-Dec 16:24 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,104 bytes ... 29-Dec 16:25 bareos-sd JobId 28: Despooling elapsed time = 00:01:35, Transfer rate = 113.0 M Bytes/second 29-Dec 16:25 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:27 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,095 MaxJobSpoolSize=10,737,418,240 29-Dec 16:27 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,095 bytes ... 29-Dec 16:29 bareos-sd JobId 28: Despooling elapsed time = 00:01:37, Transfer rate = 110.6 M Bytes/second 29-Dec 16:29 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:31 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,092 MaxJobSpoolSize=10,737,418,240 29-Dec 16:31 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,092 bytes ... 29-Dec 16:33 bareos-sd JobId 28: Despooling elapsed time = 00:01:31, Transfer rate = 117.9 M Bytes/second 29-Dec 16:33 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:35 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,102 MaxJobSpoolSize=10,737,418,240 29-Dec 16:35 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,102 bytes ... 29-Dec 16:36 bareos-sd JobId 28: Despooling elapsed time = 00:01:42, Transfer rate = 105.2 M Bytes/second 29-Dec 16:36 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:39 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,097 MaxJobSpoolSize=10,737,418,240 29-Dec 16:39 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,097 bytes ... 29-Dec 16:41 bareos-sd JobId 28: Despooling elapsed time = 00:01:47, Transfer rate = 100.3 M Bytes/second 29-Dec 16:41 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:43 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,101 MaxJobSpoolSize=10,737,418,240 29-Dec 16:43 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,101 bytes ... 29-Dec 16:45 bareos-sd JobId 28: Despooling elapsed time = 00:01:30, Transfer rate = 119.3 M Bytes/second 29-Dec 16:45 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:47 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,106 MaxJobSpoolSize=10,737,418,240 29-Dec 16:47 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,106 bytes ... 29-Dec 16:48 bareos-sd JobId 28: Despooling elapsed time = 00:01:37, Transfer rate = 110.6 M Bytes/second 29-Dec 16:49 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:51 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,106 MaxJobSpoolSize=10,737,418,240 29-Dec 16:51 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,106 bytes ... 29-Dec 16:53 bareos-sd JobId 28: Despooling elapsed time = 00:01:53, Transfer rate = 95.02 M Bytes/second 29-Dec 16:53 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:55 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,115 MaxJobSpoolSize=10,737,418,240 29-Dec 16:55 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,115 bytes ... 29-Dec 16:57 bareos-sd JobId 28: Despooling elapsed time = 00:01:34, Transfer rate = 114.2 M Bytes/second 29-Dec 16:57 bareos-sd JobId 28: Spooling data again ... 29-Dec 16:59 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,115 MaxJobSpoolSize=10,737,418,240 29-Dec 16:59 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,115 bytes ... 29-Dec 17:00 bareos-sd JobId 28: Despooling elapsed time = 00:01:32, Transfer rate = 116.7 M Bytes/second 29-Dec 17:00 bareos-sd JobId 28: Spooling data again ... 29-Dec 17:02 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,120 MaxJobSpoolSize=10,737,418,240 29-Dec 17:02 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,120 bytes ... 29-Dec 17:04 bareos-sd JobId 28: Despooling elapsed time = 00:01:41, Transfer rate = 106.3 M Bytes/second 29-Dec 17:04 bareos-sd JobId 28: Spooling data again ... 29-Dec 17:06 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,118 MaxJobSpoolSize=10,737,418,240 29-Dec 17:06 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,118 bytes ... 29-Dec 17:12 bareos-sd JobId 28: Despooling elapsed time = 00:06:22, Transfer rate = 28.10 M Bytes/second 29-Dec 17:12 bareos-sd JobId 28: Spooling data again ... 29-Dec 17:14 bareos-sd JobId 28: User specified Job spool size reached: JobSpoolSize=10,737,541,116 MaxJobSpoolSize=10,737,418,240 29-Dec 17:14 bareos-sd JobId 28: Writing spooled data to Volume. Despooling 10,737,541,116 bytes ... 29-Dec 17:53 bareos-dir JobId 28: Fatal error: Network error with FD during Backup: ERR=No data available 29-Dec 17:53 bareos-dir JobId 28: Fatal error: Director's comm line to SD dropped. 29-Dec 17:53 bareos-dir JobId 28: Fatal error: No Job status returned from FD. 29-Dec 17:53 bareos-dir JobId 28: Error: Bareos bareos-dir 21.0.0 (21Dec21): Build OS: Ubuntu 20.04.3 LTS JobId: 28 Job: san01-data.2021-12-29_14.30.28_20 Backup Level: Full Client: "san01-fd" 21.0.0 (21Dec21) Ubuntu 20.04.3 LTS,ubuntu FileSet: "LinuxData" 2021-12-29 14:26:16 Pool: "LTO4" (From user input) Catalog: "MyCatalog" (From Client resource) Storage: "Tape" (From Pool resource) Scheduled time: 29-Dec-2021 14:30:05 Start time: 29-Dec-2021 14:30:31 End time: 29-Dec-2021 17:53:50 Elapsed time: 3 hours 23 mins 19 secs Priority: 10 FD Files Written: 0 SD Files Written: 0 FD Bytes Written: 0 (0 B) SD Bytes Written: 0 (0 B) Rate: 0.0 KB/s Software Compression: None VSS: no Encryption: no Accurate: no Volume name(s): AVA026L4 Volume Session Id: 4 Volume Session Time: 1640745130 Last Volume Bytes: 631,140,215,808 (631.1 GB) Non-fatal FD errors: 1 SD Errors: 0 FD termination status: Error SD termination status: Error Bareos binary info: bareos.org build: Get official binaries and vendor support on bareos.com Job triggered by: User Termination: *** Backup Error *** sd.conf (625 bytes)
Autochanger { Name = "autochanger-0" Changer Device = /dev/tape/by-id/scsi-0IBM_3573-TL_000000000000_LL0 Device = tapedrive-0 Changer Command = "/usr/lib/bareos/scripts/mtx-changer %c %o %S %a %d" } Device { Name = "tapedrive-0" DeviceType = tape DriveIndex = 0 ArchiveDevice = /dev/tape/by-id/scsi-0000000000-nst MediaType = LTO4 AutoChanger = yes AutomaticMount = yes MaximumFileSize = 40GB Spool Directory = /var/spool/bareos Maximum Spool Size = 40GB } Director { Name = bareos-dir Password = <<REMOVED>> } Storage { Name = bareos-sd Maximum Concurrent Jobs = 20 } dir.conf (1,501 bytes)
Client { Name = san01-fd Address = localhost Password = <<REMOVED>> } Director { # define myself Name = bareos-dir QueryFile = "/usr/lib/bareos/scripts/query.sql" Maximum Concurrent Jobs = 2 Password = <<REMOVED>> Messages = Daemon Auditing = yes } FileSet { Name = "LinuxData" Include { Options { compression=LZO signature = SHA1 sparse=yes noatime=yes One FS = no } File = /data Exclude Dir Containing = .nobackup } Exclude { File = "~*" File = "*.bak" File = "*.tmp" File = "Thumbs.db" } } Job { Name = "san01-data" JobDefs = "DefaultJob" Client = "san01-fd" FileSet = LinuxData } JobDefs { Name = "DefaultJob" Type = Backup Storage = Tape Level = Incremental Schedule = "MonthlySchedule" Messages = Standard Pool = Default Spool Data = yes Spool Size = 10G Full Backup Pool = Full Differential Backup Pool = Diff Incremental Backup Pool = Incr Priority = 10 Write Bootstrap = "|/usr/bin/bsmtp -h localhost -f \"\(Bareos\) \<xxx\>\" -s \"Bootstrap %j %n %l\" xxx" # (#01) } Pool { Name = LTO4 Pool Type = Backup Recycle = yes AutoPrune = yes Volume Retention = 4 years Maximum Block Size = 1M Recycle Pool = Scratch Scratch Pool = Scratch Storage = Tape } Storage { Name = Tape Address = xxx Password = <<REMOVED>> Device = autochanger-0 Media Type = LTO4 Auto Changer = yes Collect Statistics = yes } lsscsi.txt (814 bytes)
[0:2:0:0] disk DELL PERC H700 2.10 /dev/sda /dev/sg0 [0:2:1:0] disk DELL PERC H700 2.10 /dev/sdb /dev/sg1 [0:2:2:0] disk DELL PERC H700 2.10 /dev/sdc /dev/sg2 [0:2:3:0] disk DELL PERC H700 2.10 /dev/sdd /dev/sg3 [0:2:4:0] disk DELL PERC H700 2.10 /dev/sde /dev/sg4 [0:2:5:0] disk DELL PERC H700 2.10 /dev/sdf /dev/sg5 [0:2:6:0] disk DELL PERC H700 2.10 /dev/sdg /dev/sg6 [0:2:7:0] disk DELL PERC H700 2.10 /dev/sdh /dev/sg7 [1:0:0:0] tape IBM ULT3580-HH4 C7QJ /dev/st0 /dev/sg8 [1:0:0:1] mediumx IBM 3573-TL D.10 /dev/sch0 /dev/sg9 [1:0:1:0] tape IBM ULT3580-HH6 J451 /dev/st1 /dev/sg10 sas_dmesg.txt (1,686 bytes)
[ 21.150105] mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 6, phy 0, sas_addr 0x5000e11123390002 [ 21.153608] scsi 1:0:0:0: Sequential-Access IBM ULT3580-HH4 C7QJ PQ: 0 ANSI: 3 [ 21.157607] scsi 1:0:0:0: Attached scsi generic sg8 type 1 [ 21.161891] scsi 1:0:0:1: Medium Changer IBM 3573-TL D.10 PQ: 0 ANSI: 5 [ 21.204734] scsi 1:0:0:1: Attached scsi generic sg9 type 8 [ 21.207521] mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 7, phy 4, sas_addr 0x5000e11123390005 [ 21.210842] scsi 1:0:1:0: Sequential-Access IBM ULT3580-HH6 J451 PQ: 0 ANSI: 6 [ 21.218098] scsi 1:0:1:0: Attached scsi generic sg10 type 1 [ 21.232023] SCSI Media Changer driver v0.25 [ 21.234217] st: Version 20160209, fixed bufsize 32768, s/g segs 256 [ 21.236112] st 1:0:0:0: Attached scsi tape st0 [ 21.237287] st 1:0:0:0: st0: try direct i/o: yes (alignment 512 B) [ 21.259014] ch 1:0:0:1: [ch0] type #1 (mt): 0x1+1 [medium transport] [ 21.260235] ch 1:0:0:1: [ch0] type #2 (st): 0x1000+22 [storage] [ 21.261450] ch 1:0:0:1: [ch0] type #3 (ie): 0x10+1 [import/export] [ 21.262606] ch 1:0:0:1: [ch0] type #4 (dt): 0x100+2 [data transfer] [ 21.297479] ch 1:0:0:1: [ch0] dt 0x100: [ 21.298668] ch 1:0:0:1: [ch0] ID/LUN unknown [ 21.328708] ch 1:0:0:1: [ch0] dt 0x101: [ 21.329857] ch 1:0:0:1: [ch0] ID/LUN unknown [ 21.330954] ch 1:0:0:1: [ch0] INITIALIZE ELEMENT STATUS, may take some time ... [ 21.345895] ch 1:0:0:1: [ch0] ... finished [ 21.347057] ch 1:0:0:1: Attached scsi changer ch0 [ 21.349556] st 1:0:1:0: Attached scsi tape st1 [ 21.350954] st 1:0:1:0: st1: try direct i/o: yes (alignment 512 B) |
|
Thanks for your support! There was no disk related info in dmesg before the issue. The tape library was reporting drive as "Writing". The last 2 dmes where Dec 29 18:08:19 san01 kernel: [797465.087804] st 1:0:0:0: [st0] Sense Key : Medium Error [current] Dec 29 18:08:19 san01 kernel: [797465.087808] st 1:0:0:0: [st0] Add. Sense: Write error so long after I killed the SD daemon - so I do not believe this is a real tape issue, but just a time out? Anyway I am restarting the backup with brand new tapes, so that should exclude the issue (even if the tape was only mounted 35 times (including for labeling 2-3 times ;-) ) As the tape were used in a bacula instance before I used dd to zero the first block (first with 1M then with 64K but both where abviously wrong. When I labeled with bareos I got Dec 28 18:31:03 san01 kernel: [712430.426902] st 1:0:0:0: [st0] Failed to read 1048576 byte block with 64512 byte transfer. Dec 28 18:32:39 san01 kernel: [712526.195168] st 1:0:0:0: [st0] Failed to read 1048576 byte block with 64512 byte transfer. Dec 28 18:34:04 san01 kernel: [712610.906573] st 1:0:0:0: [st0] Failed to read 1048576 byte block with 64512 byte transfer. Dec 28 18:35:35 san01 kernel: [712702.530219] st 1:0:0:0: [st0] Failed to read 1048576 byte block with 64512 byte transfer. Dec 28 18:37:06 san01 kernel: [712793.003572] st 1:0:0:0: [st0] Failed to read 1048576 byte block with 64512 byte transfer. Second attempt ;-) Dec 28 19:41:12 san01 kernel: [716639.454726] st 1:0:0:0: [st0] Failed to read 65536 byte block with 64512 byte transfer. Dec 28 19:42:40 san01 kernel: [716727.113549] st 1:0:0:0: [st0] Failed to read 65536 byte block with 64512 byte transfer. Dec 28 19:44:14 san01 kernel: [716820.806658] st 1:0:0:0: [st0] Failed to read 65536 byte block with 64512 byte transfer. Dec 28 19:45:45 san01 kernel: [716912.176423] st 1:0:0:0: [st0] Failed to read 65536 byte block with 64512 byte transfer. Dec 28 19:47:16 san01 kernel: [717002.561850] st 1:0:0:0: [st0] Failed to read 65536 byte block with 64512 byte transfer. Not sure it is related, but when I tried to restore the first job I have the error that indicated that no files were restored - even if I had selected them all, the next jobs could be restored successfully |
|
The kernel backtrace you've posted passes the st driver's st_write() function which is a quite clear indication that the write request to the tape drive didn't return. As this happens in the kernel (or the attached hardware, there is no way we can tell) there is absolutely nothing the application can do about this: Bareos calls write() and the system call never returns. Concerning cleaning out the tapes: don't use dd, just do an "mt rewind" followed by an "mt weof". |
|
Thanks. I will check with the new tape. Not a Linux developer, but is there a wait to have a wait_for_completion with a timeout? So if bareos detect that it took really too long to perform the asynchronous operation (read or write), it could report it in a better way? | |
No, we don't call wait_for_completion, the whole stacktrace you see is in the kernel. Bareos isn't even visible in that call stack. Bareos calls write() which is provided in userland by glibc, that does a syscall which is the lower end of the kernel stacktrace (the "entry_SYSCALL_64_after_hwframe"). Unless the system call returns there is absolutely nothing we can do. Also the write() function does not allow you to pass in any timeout parameters. As I said, besides hanging there is nothing an application can do in this case. |
|
Thanks a lot for your explanation. Very much appreciated. | |
I suppose we can close this report as it does not seem to be a bareos issue. Note that I read a bit about async IO but I suppose this is a huge change (would it make it easier to implement some features like continue to spool on to a file while the SD is writing to tape? In my case I think it could reduce the duration by about 45% not taking into account potential IO overhead on disk, but if there is enough memory, the writing to tape should not require actual read IO as it would be in the cache) |
|
I'll go ahead and close the issue. And yes, despooling while spooling is on our internal wishlist. We will not even need async I/O for that. |
|
As described this is a kernel issue that we cannot handle in the application. | |
Date Modified | Username | Field | Change |
---|---|---|---|
2021-12-29 18:52 | tdl | New Issue | |
2021-12-30 10:21 | bruno-at-bareos | Note Added: 0004432 | |
2021-12-30 11:42 | arogge | Note Added: 0004433 | |
2021-12-30 12:14 | tdl | File Added: job log.txt | |
2021-12-30 12:14 | tdl | File Added: sd.conf | |
2021-12-30 12:14 | tdl | File Added: dir.conf | |
2021-12-30 12:14 | tdl | File Added: lsscsi.txt | |
2021-12-30 12:14 | tdl | File Added: sas_dmesg.txt | |
2021-12-30 12:14 | tdl | Note Added: 0004434 | |
2021-12-30 12:30 | tdl | Note Added: 0004435 | |
2021-12-30 12:44 | arogge | Note Added: 0004436 | |
2021-12-30 12:51 | tdl | Note Added: 0004437 | |
2021-12-30 13:00 | arogge | Note Added: 0004438 | |
2021-12-30 13:06 | tdl | Note Added: 0004439 | |
2021-12-30 13:48 | tdl | Note Added: 0004440 | |
2021-12-30 13:50 | tdl | Note Edited: 0004440 | |
2021-12-30 13:56 | arogge | Note Added: 0004441 | |
2021-12-30 13:57 | arogge | Assigned To | => arogge |
2021-12-30 13:57 | arogge | Status | new => resolved |
2021-12-30 13:57 | arogge | Resolution | open => not fixable |
2021-12-30 13:57 | arogge | Note Added: 0004442 | |
2021-12-30 13:57 | arogge | Status | resolved => closed |