View Issue Details

IDProjectCategoryView StatusLast Update
0001231bareos-corestorage daemonpublic2022-12-08 16:23
Reportermayakov Assigned Toarogge  
PriorityhighSeveritycrashReproducibilityrandom
Status closedResolutionfixed 
PlatformLinuxOSCentOSOS Version7
Product Version18.2.5 
Summary0001231: Tasks randomly freeze when using s3 Buckets in "terminated" status
DescriptionWe recently switched to using S3 Buckets and faced the problem of freezing backup jobs.
Example:
Running Jobs:
Console connected at 20-Apr-20 12:03
Console connected at 20-Apr-20 12:03
 JobId Level Name Status
======================================================================
  2680 Full backupcrm.action-crm.local-backup.2020-04-19_10.05.00_33 is running
  2682 Full buhsoft-backup.2020-04-19_21.00.00_36 is running
  2725 Increme schl-sql-backup.2020-04-20_02.00.00_19 has terminated
  2734 Full ceph-armseller-backup.2020-04-20_04.05.00_28 is running
  2741 Full pg-service-1cont-backup.2020-04-20_12.00.00_28 is running

jobid 2725 stays in this status for more than 10 hours
Steps To ReproduceTasks are started by the scheduler. Randomly one or more of them may freeze in the status of "terminated"
Additional InformationLogs for the problem task:

20-Apr 02:00 bareos-dir JobId 2725: Start Backup JobId 2725, Job=schl-sql-backup.2020-04-20_02.00.00_19
20-Apr 02:00 bareos-dir JobId 2725: Connected Storage daemon at bareos01.backup.infra.msk3.sl.amedia.tech:9103, encryption: PSK-AES256-CBC-SHA
20-Apr 02:00 bareos-dir JobId 2725: Created new Volume "schl-sql-backupVol-1061" in catalog.
20-Apr 02:00 bareos-dir JobId 2725: Using Device "schl-sql-backupDevice" to write.
20-Apr 02:00 bareos-dir JobId 2725: Connected Client: backup.hv.amedia.tech at backup.hv.amedia.tech:9102, encryption: None
20-Apr 02:00 bareos-dir JobId 2725: Handshake: Cleartext
20-Apr 02:00 bareos-dir JobId 2725: Encryption: None
20-Apr 02:01 bareos-sd JobId 2725: Labeled new Volume "schl-sql-backupVol-1061" on device "schl-sql-backupDevice" (S3).
20-Apr 02:01 bareos-sd JobId 2725: Wrote label to prelabeled Volume "schl-sql-backupVol-1061" on device "schl-sql-backupDevice" (S3)
20-Apr 02:01 bareos-dir JobId 2725: Max Volume jobs=1 exceeded. Marking Volume "schl-sql-backupVol-1061" as Used.
20-Apr 02:51 bareos-sd JobId 2725: Releasing device "schl-sql-backupDevice" (S3).

then I try to cancel the task:
cancel JobId=2725

In the status of director, the job is marked "cancel":

Running Jobs:
Console connected at 20-Apr-20 17:12
 JobId Level Name Status
======================================================================
  2682 Full buhsoft-backup.2020-04-19_21.00.00_36 is running
  2734 Full ceph-armseller-backup.2020-04-20_04.05.00_28 is running
====

Terminated Jobs:
 JobId Level Files Bytes Status Finished Name
====================================================================
  2739 Full 18 137.9 G OK 20-Apr-20 07:32 ss1-npd-backup
  2737 Full 2 24.09 G OK 20-Apr-20 07:56 mysql-az02-goszakaz-backup
  2736 Full 2 29.73 G OK 20-Apr-20 08:15 mysql-ap02-goszakaz-backup
  2740 Full 39 61.08 G OK 20-Apr-20 11:14 ss2-srv17-backup
  2725 Incr 835 167.0 G Cancel 20-Apr-20 12:07 schl-sql-backup


Next, check the status storage:

*status storage=schl-sql-backupStorage
Connecting to Storage daemon schl-sql-backupStorage at bareos01.backup.infra....:9103

bareos-sd Version: 18.2.5 (30 January 2019) Linux-4.4.92-6.18-default redhat CentOS Linux release 7.6.1810 (Core)
Daemon started 18-Apr-20 20:09. Jobs: run=137, running=2, bareos.org build binary
 Heap: heap=331,776 smbytes=420,855,257 max_bytes=2,834,647,307 bufs=2,073 max_bufs=2,520
 Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8 mode=0 bwlimit=0kB/s

Running Jobs:
Writing: Full Backup job buhsoft-backup JobId=2682 Volume="buhsoft-backupVol-0326"
    pool="buhsoft-backupPool" device="buhsoft-backupDevice" (S3)
    spooling=0 despooling=0 despool_wait=0
    Files=280,608 Bytes=6,232,850,155,858 AveBytes/sec=114,582,893 LastBytes/sec=114,205,268
    FDReadSeqNo=97,439,276 in_msg=96631753 out_msg=5 fd=125
Writing: Incremental Backup job schl-sql-backup JobId=2725 Volume="schl-sql-backupVol-1061"
    pool="schl-sql-backupPool" device="schl-sql-backupDevice" (S3)
    spooling=0 despooling=0 despool_wait=0
    Files=835 Bytes=167,042,516,995 AveBytes/sec=0 LastBytes/sec=0
    FDReadSeqNo=2,555,971 in_msg=2553486 out_msg=9 fd=156
Writing: Full Backup job ceph-armseller-backup JobId=2734 Volume="ceph-armseller-backupVol-1069"
    pool="ceph-armseller-backupPool" device="ceph-armseller-backupDevice" (S3)
    spooling=0 despooling=0 despool_wait=0
    Files=1,570,915 Bytes=562,719,840,061 AveBytes/sec=12,607,520 LastBytes/sec=11,383,272
    FDReadSeqNo=18,690,064 in_msg=15189896 out_msg=5 fd=128
====

Jobs waiting to reserve a drive:
====

Terminated Jobs:
 JobId Level Files Bytes Status Finished Name
===================================================================
  2738 Full 1 2.325 G OK 20-Apr-20 07:01 id2-redismsk7-backup
  2739 Full 18 137.9 G OK 20-Apr-20 07:32 ss1-npd-backup
  2737 Full 2 24.09 G OK 20-Apr-20 07:56 mysql-az02-goszakaz-backup
  2736 Full 2 29.73 G OK 20-Apr-20 08:15 mysql-ap02-goszakaz-backup
  2740 Full 39 61.08 G OK 20-Apr-20 11:14 ss2-srv17-backup
  2742 Incr 0 0 Cancel 20-Apr-20 12:23 schl-sql-backup
  2680 Full 176 14.10 T OK 20-Apr-20 14:25 backupcrm.action-crm.local-backup
  2741 Full 4 35.23 G OK 20-Apr-20 14:53 pg-service-1cont-backup
  2743 Incr 0 0 Cancel 20-Apr-20 15:40 schl-sql-backup
  2744 Incr 0 0 Cancel 20-Apr-20 17:04 schl-sql-backup
====

Device status:

Device "schl-sql-backupDevice" (S3) is mounted with:
    Volume: schl-sql-backupVol-1061
    Pool: schl-sql-backupPool
    Media type: S3_Object1
Backend connection is not working.
Inflight chunks: 0
No pending IO flush requests.
Configured device capabilities:
  EOF BSR BSF FSR FSF EOM !REM RACCESS AUTOMOUNT LABEL !ANONVOLS !ALWAYSOPEN
Device state:
  OPENED !TAPE LABEL !MALLOC APPEND !READ !EOT !WEOT !EOF !NEXTVOL !SHORT MOUNTED
  num_writers=1 reserves=0 block=0
Attached Jobs: 2725
Device parameters:
  Archive name: S3 Device name: schl-sql-backupDevice
  File=38 block=3957652387
  Min block=64512 Max block=64512
    Total Bytes=167,166,409,636 Blocks=2,591,246 Bytes/block=64,511
    Positioned at File=38 Block=3,957,652,387
==
====

Used Volume status:
schl-sql-backupVol-1061 on device "schl-sql-backupDevice" (S3)
    Reader=0 writers=1 reserves=0 volinuse=1
====

====

an attempt to cancel the job yields nothing:

*cancel storage=schl-sql-backupStorage jobid=2725
3000 JobId=2725 Job="schl-sql-backup.2020-04-20_02.00.00_19" marked to be canceled.

when you try to restart the backup or restore task, task freezes in waiting status:
  2745 Increme schl-sql-backup.2020-04-20_17.20.21_52 is waiting on Storage "schl-sql-backupStorage"


The configuration of all backups is the same. We use a separate storage, pool, device, backet for each job:

Device {
  Name = schl-sql-backupDevice
  Media Type = S3_Object1
  Archive Device = S3 Object Storage
  Device Options = "profile=/etc/bareos/bareos-sd.d/device/droplet/droplet.profile,bucket=schl-sql-backup,chunksize=100M"
  Device Type = droplet
  LabelMedia = yes;
  Random Access = yes;
  AutomaticMount = yes;
  RemovableMedia = no;
  AlwaysOpen = no;
  Description = "S3 device"
  Maximum Concurrent Jobs = 60
}

Pool {
  Name = schl-sql-backupPool
  Pool Type = Backup
  Recycle = yes
  AutoPrune = yes
  Volume Retention = 2 week
  Job Retention = 30 days
  File Retention = 30 days
  Maximum Volume Jobs = 1
  Maximum Volume Bytes = 1000G
  Maximum Volumes = 20
  Label Format = "schl-sql-backupVol-"
}

Storage {
  Name = schl-sql-backupStorage
  Address = bareos01.****
  Password = "****"
  Device = schl-sql-backupDevice
  Media Type = S3_Object1
  Maximum Concurrent Jobs = 60
}



Client {
  Name = backup.****
  Address = backup.*****
  Password = "*****"
  Maximum Concurrent Jobs = 60
}
FileSet {
  Name = "schl-sql-fileset"
  Include {
    Options {
      signature = MD5
    }
    File = "/mnt/storage/schl/SQL"
  }
}
Job {
  Name = "schl-sql-backup"
  JobDefs = "DefaultJob"
  Write Bootstrap = "/var/lib/bareos/schlsql.bsr"
  Client = backup.hv.amedia.tech
  FileSet = "schl-sql-fileset"
  Storage = "schl-sql-backupStorage"
  Pool = "schl-sql-backupPool"
  Schedule = "schl-sql-schedule"
}
Schedule {
  Name = "schl-sql-schedule"
  Run = Incremental mon-sat at 2:00
  Run = Full sun at 8:00
}

The problem happens randomly, they can complete all tasks (about 80) without problems. In parallel with the problem knowledge, many tasks are usually completed that will complete well and be able to complete the "Releasing device"

You can release a problem device only by running:
systemctl restart bareos-sd.service
all running tasks die


Sorry, I used google translate.
Tagss3;droplet;aws;storage

Activities

andrei693

andrei693

2020-05-19 12:17

reporter   ~0003990

Have a similar issue to which so far we have the same bareos-sd restart resolution. What happens for me is that the director can't cancel the job on the storage daemon. This is from the joblog:
19-May 06:47 bareos-sd JobId 3693: Releasing device "AWS_S3_1-01" (AWS S3 Storage).
19-May 06:47 bareos-dir-alxp JobId 3693: Fatal error: Director's comm line to SD dropped.

Tried to cancel the job on the storage daemon itself but that does not work either:
*cancel storage=S3_Object_aws jobid=3693
3000 JobId=3693 Job="client1.2020-05-19_06.34.03_30" marked to be canceled.

The jobs stays in the same state on the SD.
mayakov

mayakov

2020-06-15 19:44

reporter   ~0004009

@andrei693 version - 18.2.5 ?
andrei693

andrei693

2020-06-15 22:33

reporter   ~0004010

@mayakov: yes, 18.2.5.
arogge

arogge

2021-08-18 12:26

manager   ~0004212

Could you check if the volume, that has just been written, contains a chunk that is one byte short?
It seems like there is a corner-case writing volumes what may lead to a missing byte which will eventually lead to the SD waiting indefinitely until all chunks are completely written (which will never happen as one byte is missing).
andrei693

andrei693

2021-08-18 14:40

reporter   ~0004213

Unfortunately we had to move on from that setup and can no longer test/reproduce.

Issue History

Date Modified Username Field Change
2020-04-20 16:52 mayakov New Issue
2020-04-20 16:52 mayakov Tag Attached: s3;droplet;aws;storage
2020-05-19 12:17 andrei693 Note Added: 0003990
2020-06-15 19:44 mayakov Note Added: 0004009
2020-06-15 22:33 andrei693 Note Added: 0004010
2021-08-18 12:26 arogge Assigned To => arogge
2021-08-18 12:26 arogge Status new => feedback
2021-08-18 12:26 arogge Note Added: 0004212
2021-08-18 14:40 andrei693 Note Added: 0004213
2022-12-08 16:23 arogge Status feedback => resolved
2022-12-08 16:23 arogge Resolution open => fixed
2022-12-08 16:23 arogge Status resolved => closed