View Issue Details

IDProjectCategoryView StatusLast Update
0001070bareos-core[All Projects] storage daemonpublic2019-04-10 16:37
ReporterguidoilbaldoAssigned Toarogge 
PriorityhighSeveritymajorReproducibilityalways
Status assignedResolutionopen 
PlatformLinuxOSCentOSOS Version7
Product Version18.2.5 
Target VersionFixed in Version 
Summary0001070: DIR lost connection with SD - Broken pipe
DescriptionHello,
I recently installed BareOS on a single machine (DIR + SD) and installed droplet plugin to send data to AWS S3.
The configuration looks fine as bconsole tells me storage status is ok:
```
Device status:

Device "AWS_S3_Object_Storage" (AWS S3 Storage) is not open.
Backend connection is working.
Inflight chunks: 0
No pending IO flush requests.
==
====
```

However, when launching a FULL backup job, after 15 minutes it fails with the following error:
```
27-Mar 11:03 rabbit-1-fd JobId 3: Error: lib/bsock_tcp.cc:417 Wrote 9611 bytes to Storage daemon:172.17.0.60:9103, but only 0 accepted.
27-Mar 11:03 rabbit-1-fd JobId 3: Fatal error: filed/backup.cc:1033 Network send error to SD. ERR=Connection timed out
27-Mar 11:03 backup-1 JobId 3: Fatal error: Director's comm line to SD dropped.
27-Mar 11:03 backup-1 JobId 3: Error: Bareos backup-1 18.2.5 (30Jan19):
  Build OS: Linux-4.4.92-6.18-default redhat CentOS Linux release 7.6.1810 (Core)
  JobId: 3
  Job: backup-rabbit-1.2019-03-27_10.34.06_14
  Backup Level: Full
  Client: "rabbit-1-fd" 18.2.5 (30Jan19) Linux-4.4.92-6.18-default,redhat,CentOS Linux release 7.6.1810 (Core) ,CentOS_7,x86_64
  FileSet: "RabbitMQFileSet" 2019-03-27 09:42:15
  Pool: "Full" (From command line)
  Catalog: "MyCatalog" (From Client resource)
  Storage: "S3_Object" (From Job resource)
  Scheduled time: 27-Mar-2019 10:34:06
  Start time: 27-Mar-2019 10:48:17
  End time: 27-Mar-2019 11:03:51
  Elapsed time: 15 mins 34 secs
  Priority: 10
  FD Files Written: 24
  SD Files Written: 0
  FD Bytes Written: 95,320 (95.32 KB)
  SD Bytes Written: 691 (691 B)
  Rate: 0.1 KB/s
  Software Compression: 24.3 % (lz4)
  VSS: no
  Encryption: no
  Accurate: no
  Volume name(s):
  Volume Session Id: 2
  Volume Session Time: 1553682684
  Last Volume Bytes: 0 (0 B)
  Non-fatal FD errors: 1
  SD Errors: 0
  FD termination status: Fatal Error
  SD termination status: Error
  Bareos binary info: bareos.org build: Get official binaries and vendor support on bareos.com
  Termination: *** Backup Error ***
```

The connection between FD on backup client and BareOS server looks fine, I can telnet ports 9101 and 9103 to DIR and SD correctly.
Steps To ReproduceInstall bareos stack 18.2.5 and try to perform a FULL backup job, sending data to S3
Additional InformationDIR config:

Client {
  Name = rabbit-1-fd
  Address = 172.17.0.170
  Password = ****
  Heartbeat Interval = 60
  File Retention = 30 days
  Job Retention = 6 months
  AutoPrune = yes
}

Job {
  Name = "backup-rabbit-1"
  Client = rabbit-1-fd
  JobDefs = "RabbitMQJobDef"
}

JobDefs {
  Name = "RabbitMQJobDef"
  Type = Backup
  Level = Incremental

  Client = bareos-fd

  FileSet = "RabbitMQFileSet"
  Schedule = "RabbitMQCycle"
  Storage = S3_Object

  Messages = Standard

  Pool = Incremental

  Priority = 10
  Write Bootstrap = "/var/lib/bareos/%c.bsr"

  Full Backup Pool = Full
  Differential Backup Pool = Differential
  Incremental Backup Pool = Incremental
}

Storage {
  Name = S3_Object
  Address = 172.17.0.60
  Password = ****
  Device = AWS_S3_Object_Storage
  Media Type = S3_Object1
  Heartbeat Interval = 60
}

SD config:

Device {
  Name = "AWS_S3_Object_Storage"
  Media Type = "S3_Object1"
  Archive Device = "AWS S3 Storage"
  Device Type = droplet
  Device Options = "profile=/etc/bareos/bareos-sd.d/device/droplet/aws.droplet.profile,bucket=bucket,location=eu-central-1,chunksize=100M,iothreads=10,retries=0"
  LabelMedia = yes
  Random Access = yes
  AutomaticMount = yes
  RemovableMedia = no
  AlwaysOpen = no
  Maximum Concurrent Jobs = 10
  Maximum File Size = 500M
  Maximum Spool Size = 15000M
}

use_https = true
host = s3.eu-central-1.amazonaws.com
access_key = <access_key>
secret_key = <secret_key>
pricing_dir = ""
backend = s3
aws_region = eu-central-1
aws_auth_sign_version = 4
Tagsaws, s3;droplet;aws;storage, storage
bareos-master: impact
bareos-master: action
bareos-18.2: impact
bareos-18.2: action
bareos-17.2: impact
bareos-17.2: action
bareos-16.2: impact
bareos-16.2: action
bareos-15.2: impact
bareos-15.2: action
bareos-14.2: impact
bareos-14.2: action
bareos-13.2: impact
bareos-13.2: action
bareos-12.4: impact
bareos-12.4: action

Activities

arogge

arogge

2019-03-27 12:43

developer   ~0003301

Can you please enable tracing of the sd (maybe level 200) and reproduce this?
guidoilbaldo

guidoilbaldo

2019-03-27 15:22

reporter   ~0003302

Sure, could you help me with that or point me somewhere in the docs where I can find how?
guidoilbaldo

guidoilbaldo

2019-03-27 15:26

reporter   ~0003303

Ok, i found where to look.

I launched the job again and keep you posted, in the meantime this is the device status:

Device status:

Device "AWS_S3_Object_Storage" (AWS S3 Storage) is mounted with:
    Volume: Full-0001
    Pool: Full
    Media type: S3_Object1
Backend connection is working.
Inflight chunks: 0
No pending IO flush requests.
Configured device capabilities:
  EOF BSR BSF FSR FSF EOM !REM RACCESS AUTOMOUNT LABEL !ANONVOLS !ALWAYSOPEN
Device state:
  OPENED !TAPE LABEL !MALLOC APPEND !READ EOT !WEOT !EOF !NEXTVOL !SHORT MOUNTED
  num_writers=1 reserves=0 block=0
Attached Jobs: 4
Device parameters:
  Archive name: AWS S3 Storage Device name: AWS_S3_Object_Storage
  File=0 block=64712
  Min block=64512 Max block=64512
    Total Bytes=64,712 Blocks=0 Bytes/block=64,712
    Positioned at File=0 Block=64,712
guidoilbaldo

guidoilbaldo

2019-03-27 16:02

reporter   ~0003304

This time the error was "Connection timed out". I attached the trace file for the SD and here I paste bareos log for completion:

27-Mar 14:45 backup-1 JobId 6: Using Device "AWS_S3_Object_Storage" to write.
27-Mar 14:45 backup-1 JobId 6: Probing client protocol... (result will be saved until config reload)
27-Mar 14:45 backup-1 JobId 6: Connected Client: rabbit-1-fd at 172.17.0.170:9102, encryption: PSK-AES256-CBC-SHA
27-Mar 14:45 backup-1 JobId 6: Handshake: Immediate TLS 27-Mar 14:45 backup-1 JobId 6: Encryption: PSK-AES256-CBC-SHA
27-Mar 14:45 rabbit-1-fd JobId 6: Connected Storage daemon at 172.17.0.60:9103, encryption: PSK-AES256-CBC-SHA
27-Mar 14:45 backup-1: info: src/droplet.c:127: dpl_init: PRNG has been seeded with enough data
27-Mar 14:45 rabbit-1-fd JobId 6: Extended attribute support is enabled
27-Mar 14:45 rabbit-1-fd JobId 6: ACL support is enabled
27-Mar 14:45 backup-1 JobId 6: Volume "Full-0001" previously written, moving to end of data.
27-Mar 14:45 backup-1 JobId 6: Ready to append to end of Volume "Full-0001" size=64712
27-Mar 15:00 rabbit-1-fd JobId 6: Error: lib/bsock_tcp.cc:417 Wrote 22084 bytes to Storage daemon:172.17.0.60:9103, but only 0 accepted.
27-Mar 15:00 rabbit-1-fd JobId 6: Fatal error: filed/backup.cc:1033 Network send error to SD. ERR=Connection timed out
27-Mar 15:00 backup-1 JobId 6: Fatal error: Director's comm line to SD dropped.
27-Mar 15:00 backup-1 JobId 6: Error: Bareos backup-1 18.2.5 (30Jan19):
  Build OS: Linux-4.4.92-6.18-default redhat CentOS Linux release 7.6.1810 (Core)
  JobId: 6
  Job: backup-rabbit-1.2019-03-27_14.44.59_07
  Backup Level: Full
  Client: "rabbit-1-fd" 18.2.5 (30Jan19) Linux-4.4.92-6.18-default,redhat,CentOS Linux release 7.6.1810 (Core) ,CentOS_7,x86_64
  FileSet: "RabbitMQFileSet" 2019-03-27 09:42:15
  Pool: "Full" (From command line)
  Catalog: "MyCatalog" (From Client resource)
  Storage: "S3_Object" (From Job resource)
  Scheduled time: 27-Mar-2019 14:44:59
  Start time: 27-Mar-2019 14:45:01
  End time: 27-Mar-2019 15:00:29
  Elapsed time: 15 mins 28 secs
  Priority: 10
  FD Files Written: 27
  SD Files Written: 0
  FD Bytes Written: 164,032 (164.0 KB)
  SD Bytes Written: 1,852 (1.852 KB)
  Rate: 0.2 KB/s
  Software Compression: 27.4 % (lz4)
  VSS: no
  Encryption: no
  Accurate: no
  Volume name(s):
  Volume Session Id: 1
  Volume Session Time: 1553697872
  Last Volume Bytes: 0 (0 B)
  Non-fatal FD errors: 1
  SD Errors: 0
  FD termination status: Fatal Error
  SD termination status: Error
  Bareos binary info: bareos.org build: Get official binaries and vendor support on bareos.com
  Termination: *** Backup Error ***

backup-1.trace (27,106 bytes)
arogge

arogge

2019-03-28 07:53

developer   ~0003305

According to the trace, your director told the sd to cancel the job:

backup-1 (110): stored/socket_server.cc:97-0 Conn: Hello Director backup-1 calling
backup-1 (110): stored/socket_server.cc:115-0 Got a DIR connection at 27-Mar-2019 15:00:29
backup-1 (100): include/jcr.h:320-0 Construct JobControlRecord
backup-1 (50): lib/cram_md5.cc:69-0 send: auth cram-md5 <461601013.1553698829@backup-1> ssl=1
backup-1 (100): lib/cram_md5.cc:116-0 cram-get received: auth cram-md5 <950524310.1553698829@backup-1> ssl=1
backup-1 (99): lib/cram_md5.cc:135-0 sending resp to challenge: XVkIXSZZLjpi2/+sNi/X8A
backup-1 (90): stored/dir_cmd.cc:289-0 Message channel init completed.
backup-1 (199): stored/dir_cmd.cc:300-0 <dird: cancel Job=backup-rabbit-1.2019-03-27_14.44.59_07
backup-1 (200): stored/dir_cmd.cc:318-0 Do command: cancel

Do you have any idea why this might have happened?
guidoilbaldo

guidoilbaldo

2019-03-29 08:56

reporter   ~0003306

No, unfortunately not. I just launched the job from bareos-webui and left it to complete.
Do you think it might be some misconfiguration in DIR or SD configs I posted in the "additional information" part?
guidoilbaldo

guidoilbaldo

2019-04-01 17:31

reporter   ~0003308

Hi,
if you have any hint why our configuration is not working, I would like to know because I'm in the process of deciding whether to stick with BareOS or finding a possible alternative.
Thank you very much,
Stefano
guidoilbaldo

guidoilbaldo

2019-04-10 15:23

reporter   ~0003316

Is it possible to have a clue on this?
arogge

arogge

2019-04-10 16:37

developer   ~0003317

Sorry, I have really no idea.
Maybe you can take this to the mailing list and somebody can help?

Issue History

Date Modified Username Field Change
2019-03-27 12:31 guidoilbaldo New Issue
2019-03-27 12:31 guidoilbaldo Tag Attached: aws
2019-03-27 12:31 guidoilbaldo Tag Attached: s3;droplet;aws;storage
2019-03-27 12:31 guidoilbaldo Tag Attached: storage
2019-03-27 12:43 arogge Assigned To => arogge
2019-03-27 12:43 arogge Status new => feedback
2019-03-27 12:43 arogge Note Added: 0003301
2019-03-27 15:22 guidoilbaldo Note Added: 0003302
2019-03-27 15:22 guidoilbaldo Status feedback => assigned
2019-03-27 15:26 guidoilbaldo Note Added: 0003303
2019-03-27 16:02 guidoilbaldo File Added: backup-1.trace
2019-03-27 16:02 guidoilbaldo Note Added: 0003304
2019-03-28 07:53 arogge Status assigned => feedback
2019-03-28 07:53 arogge Note Added: 0003305
2019-03-29 08:56 guidoilbaldo Note Added: 0003306
2019-03-29 08:56 guidoilbaldo Status feedback => assigned
2019-04-01 17:31 guidoilbaldo Note Added: 0003308
2019-04-10 15:23 guidoilbaldo Note Added: 0003316
2019-04-10 16:37 arogge Note Added: 0003317