0001231: Tasks randomly freeze when using s3 Buckets in "terminated" status - Bareos Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0001231	bareos-core	storage daemon	public	2020-04-20 16:52	2022-12-08 16:23

Reporter	mayakov	Assigned To	arogge
Priority	high	Severity	crash	Reproducibility	random
Status	closed	Resolution	fixed
Platform	Linux	OS	CentOS	OS Version	7
Product Version	18.2.5

Summary	0001231: Tasks randomly freeze when using s3 Buckets in "terminated" status
Description	We recently switched to using S3 Buckets and faced the problem of freezing backup jobs. Example: Running Jobs: Console connected at 20-Apr-20 12:03 Console connected at 20-Apr-20 12:03 JobId Level Name Status ====================================================================== 2680 Full backupcrm.action-crm.local-backup.2020-04-19_10.05.00_33 is running 2682 Full buhsoft-backup.2020-04-19_21.00.00_36 is running 2725 Increme schl-sql-backup.2020-04-20_02.00.00_19 has terminated 2734 Full ceph-armseller-backup.2020-04-20_04.05.00_28 is running 2741 Full pg-service-1cont-backup.2020-04-20_12.00.00_28 is running jobid 2725 stays in this status for more than 10 hours
Steps To Reproduce	Tasks are started by the scheduler. Randomly one or more of them may freeze in the status of "terminated"
Additional Information	Logs for the problem task: 20-Apr 02:00 bareos-dir JobId 2725: Start Backup JobId 2725, Job=schl-sql-backup.2020-04-20_02.00.00_19 20-Apr 02:00 bareos-dir JobId 2725: Connected Storage daemon at bareos01.backup.infra.msk3.sl.amedia.tech:9103, encryption: PSK-AES256-CBC-SHA 20-Apr 02:00 bareos-dir JobId 2725: Created new Volume "schl-sql-backupVol-1061" in catalog. 20-Apr 02:00 bareos-dir JobId 2725: Using Device "schl-sql-backupDevice" to write. 20-Apr 02:00 bareos-dir JobId 2725: Connected Client: backup.hv.amedia.tech at backup.hv.amedia.tech:9102, encryption: None 20-Apr 02:00 bareos-dir JobId 2725: Handshake: Cleartext 20-Apr 02:00 bareos-dir JobId 2725: Encryption: None 20-Apr 02:01 bareos-sd JobId 2725: Labeled new Volume "schl-sql-backupVol-1061" on device "schl-sql-backupDevice" (S3). 20-Apr 02:01 bareos-sd JobId 2725: Wrote label to prelabeled Volume "schl-sql-backupVol-1061" on device "schl-sql-backupDevice" (S3) 20-Apr 02:01 bareos-dir JobId 2725: Max Volume jobs=1 exceeded. Marking Volume "schl-sql-backupVol-1061" as Used. 20-Apr 02:51 bareos-sd JobId 2725: Releasing device "schl-sql-backupDevice" (S3). then I try to cancel the task: cancel JobId=2725 In the status of director, the job is marked "cancel": Running Jobs: Console connected at 20-Apr-20 17:12 JobId Level Name Status ====================================================================== 2682 Full buhsoft-backup.2020-04-19_21.00.00_36 is running 2734 Full ceph-armseller-backup.2020-04-20_04.05.00_28 is running ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name ==================================================================== 2739 Full 18 137.9 G OK 20-Apr-20 07:32 ss1-npd-backup 2737 Full 2 24.09 G OK 20-Apr-20 07:56 mysql-az02-goszakaz-backup 2736 Full 2 29.73 G OK 20-Apr-20 08:15 mysql-ap02-goszakaz-backup 2740 Full 39 61.08 G OK 20-Apr-20 11:14 ss2-srv17-backup 2725 Incr 835 167.0 G Cancel 20-Apr-20 12:07 schl-sql-backup Next, check the status storage: status storage=schl-sql-backupStorage Connecting to Storage daemon schl-sql-backupStorage at bareos01.backup.infra....:9103 bareos-sd Version: 18.2.5 (30 January 2019) Linux-4.4.92-6.18-default redhat CentOS Linux release 7.6.1810 (Core) Daemon started 18-Apr-20 20:09. Jobs: run=137, running=2, bareos.org build binary Heap: heap=331,776 smbytes=420,855,257 max_bytes=2,834,647,307 bufs=2,073 max_bufs=2,520 Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8 mode=0 bwlimit=0kB/s Running Jobs: Writing: Full Backup job buhsoft-backup JobId=2682 Volume="buhsoft-backupVol-0326" pool="buhsoft-backupPool" device="buhsoft-backupDevice" (S3) spooling=0 despooling=0 despool_wait=0 Files=280,608 Bytes=6,232,850,155,858 AveBytes/sec=114,582,893 LastBytes/sec=114,205,268 FDReadSeqNo=97,439,276 in_msg=96631753 out_msg=5 fd=125 Writing: Incremental Backup job schl-sql-backup JobId=2725 Volume="schl-sql-backupVol-1061" pool="schl-sql-backupPool" device="schl-sql-backupDevice" (S3) spooling=0 despooling=0 despool_wait=0 Files=835 Bytes=167,042,516,995 AveBytes/sec=0 LastBytes/sec=0 FDReadSeqNo=2,555,971 in_msg=2553486 out_msg=9 fd=156 Writing: Full Backup job ceph-armseller-backup JobId=2734 Volume="ceph-armseller-backupVol-1069" pool="ceph-armseller-backupPool" device="ceph-armseller-backupDevice" (S3) spooling=0 despooling=0 despool_wait=0 Files=1,570,915 Bytes=562,719,840,061 AveBytes/sec=12,607,520 LastBytes/sec=11,383,272 FDReadSeqNo=18,690,064 in_msg=15189896 out_msg=5 fd=128 ==== Jobs waiting to reserve a drive: ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name =================================================================== 2738 Full 1 2.325 G OK 20-Apr-20 07:01 id2-redismsk7-backup 2739 Full 18 137.9 G OK 20-Apr-20 07:32 ss1-npd-backup 2737 Full 2 24.09 G OK 20-Apr-20 07:56 mysql-az02-goszakaz-backup 2736 Full 2 29.73 G OK 20-Apr-20 08:15 mysql-ap02-goszakaz-backup 2740 Full 39 61.08 G OK 20-Apr-20 11:14 ss2-srv17-backup 2742 Incr 0 0 Cancel 20-Apr-20 12:23 schl-sql-backup 2680 Full 176 14.10 T OK 20-Apr-20 14:25 backupcrm.action-crm.local-backup 2741 Full 4 35.23 G OK 20-Apr-20 14:53 pg-service-1cont-backup 2743 Incr 0 0 Cancel 20-Apr-20 15:40 schl-sql-backup 2744 Incr 0 0 Cancel 20-Apr-20 17:04 schl-sql-backup ==== Device status: Device "schl-sql-backupDevice" (S3) is mounted with: Volume: schl-sql-backupVol-1061 Pool: schl-sql-backupPool Media type: S3_Object1 Backend connection is not working. Inflight chunks: 0 No pending IO flush requests. Configured device capabilities: EOF BSR BSF FSR FSF EOM !REM RACCESS AUTOMOUNT LABEL !ANONVOLS !ALWAYSOPEN Device state: OPENED !TAPE LABEL !MALLOC APPEND !READ !EOT !WEOT !EOF !NEXTVOL !SHORT MOUNTED num_writers=1 reserves=0 block=0 Attached Jobs: 2725 Device parameters: Archive name: S3 Device name: schl-sql-backupDevice File=38 block=3957652387 Min block=64512 Max block=64512 Total Bytes=167,166,409,636 Blocks=2,591,246 Bytes/block=64,511 Positioned at File=38 Block=3,957,652,387 == ==== Used Volume status: schl-sql-backupVol-1061 on device "schl-sql-backupDevice" (S3) Reader=0 writers=1 reserves=0 volinuse=1 ==== ==== an attempt to cancel the job yields nothing: cancel storage=schl-sql-backupStorage jobid=2725 3000 JobId=2725 Job="schl-sql-backup.2020-04-20_02.00.00_19" marked to be canceled. when you try to restart the backup or restore task, task freezes in waiting status: 2745 Increme schl-sql-backup.2020-04-20_17.20.21_52 is waiting on Storage "schl-sql-backupStorage" The configuration of all backups is the same. We use a separate storage, pool, device, backet for each job: Device { Name = schl-sql-backupDevice Media Type = S3_Object1 Archive Device = S3 Object Storage Device Options = "profile=/etc/bareos/bareos-sd.d/device/droplet/droplet.profile,bucket=schl-sql-backup,chunksize=100M" Device Type = droplet LabelMedia = yes; Random Access = yes; AutomaticMount = yes; RemovableMedia = no; AlwaysOpen = no; Description = "S3 device" Maximum Concurrent Jobs = 60 } Pool { Name = schl-sql-backupPool Pool Type = Backup Recycle = yes AutoPrune = yes Volume Retention = 2 week Job Retention = 30 days File Retention = 30 days Maximum Volume Jobs = 1 Maximum Volume Bytes = 1000G Maximum Volumes = 20 Label Format = "schl-sql-backupVol-" } Storage { Name = schl-sql-backupStorage Address = bareos01.** Password = "" Device = schl-sql-backupDevice Media Type = S3_Object1 Maximum Concurrent Jobs = 60 } Client { Name = backup. Address = backup.* Password = "***" Maximum Concurrent Jobs = 60 } FileSet { Name = "schl-sql-fileset" Include { Options { signature = MD5 } File = "/mnt/storage/schl/SQL" } } Job { Name = "schl-sql-backup" JobDefs = "DefaultJob" Write Bootstrap = "/var/lib/bareos/schlsql.bsr" Client = backup.hv.amedia.tech FileSet = "schl-sql-fileset" Storage = "schl-sql-backupStorage" Pool = "schl-sql-backupPool" Schedule = "schl-sql-schedule" } Schedule { Name = "schl-sql-schedule" Run = Incremental mon-sat at 2:00 Run = Full sun at 8:00 } The problem happens randomly, they can complete all tasks (about 80) without problems. In parallel with the problem knowledge, many tasks are usually completed that will complete well and be able to complete the "Releasing device" You can release a problem device only by running: systemctl restart bareos-sd.service all running tasks die Sorry, I used google translate.
Tags	s3;droplet;aws;storage

andrei693 2020-05-19 12:17 reporter ~0003990	Have a similar issue to which so far we have the same bareos-sd restart resolution. What happens for me is that the director can't cancel the job on the storage daemon. This is from the joblog: 19-May 06:47 bareos-sd JobId 3693: Releasing device "AWS_S3_1-01" (AWS S3 Storage). 19-May 06:47 bareos-dir-alxp JobId 3693: Fatal error: Director's comm line to SD dropped. Tried to cancel the job on the storage daemon itself but that does not work either: *cancel storage=S3_Object_aws jobid=3693 3000 JobId=3693 Job="client1.2020-05-19_06.34.03_30" marked to be canceled. The jobs stays in the same state on the SD.

mayakov 2020-06-15 19:44 reporter ~0004009	@andrei693 version - 18.2.5 ?

andrei693 2020-06-15 22:33 reporter ~0004010	@mayakov: yes, 18.2.5.

arogge 2021-08-18 12:26 manager ~0004212	Could you check if the volume, that has just been written, contains a chunk that is one byte short? It seems like there is a corner-case writing volumes what may lead to a missing byte which will eventually lead to the SD waiting indefinitely until all chunks are completely written (which will never happen as one byte is missing).

andrei693 2021-08-18 14:40 reporter ~0004213	Unfortunately we had to move on from that setup and can no longer test/reproduce.

Date Modified	Username	Field	Change
2020-04-20 16:52	mayakov	New Issue
2020-04-20 16:52	mayakov	Tag Attached: s3;droplet;aws;storage
2020-05-19 12:17	andrei693	Note Added: 0003990
2020-06-15 19:44	mayakov	Note Added: 0004009
2020-06-15 22:33	andrei693	Note Added: 0004010
2021-08-18 12:26	arogge	Assigned To	=> arogge
2021-08-18 12:26	arogge	Status	new => feedback
2021-08-18 12:26	arogge	Note Added: 0004212
2021-08-18 14:40	andrei693	Note Added: 0004213
2022-12-08 16:23	arogge	Status	feedback => resolved
2022-12-08 16:23	arogge	Resolution	open => fixed
2022-12-08 16:23	arogge	Status	resolved => closed