View Issue Details

IDProjectCategoryView StatusLast Update
0001019bareos-coredirectorpublic2019-12-18 15:24
Reporterwizhippo Assigned To 
PrioritynormalSeveritymajorReproducibilityalways
Status closedResolutionfixed 
Platformx86OSUbuntuOS Version18.04
Product Version18.2.4-rc1 
Summary0001019: Director hangs waiting for client if not available PSK
DescriptionUsing TLS Psk Require = yes if a client is offline the director hangs waiting with:

delllt.2018-10-07_22.00.00_24 is waiting for Client to connect (Client Initiated Connection)

The log in bareos gui shows:

2018-10-07 22:06:47 kamino-dir JobId 1337:
Try to establish a secure connection by immediate TLS handshake:
2018-10-07 22:06:47 kamino-dir JobId 1337: Fatal error: Failed to connect to client "delllt-fd".
2018-10-07 22:06:35 kamino-dir JobId 1337: Fatal error: lib/bsock_tcp.cc:139 Unable to connect to Client: delllt-fd on delllt:9102. ERR=No route to host
2018-10-07 22:03:39 kamino-dir JobId 1337: Warning: lib/bsock_tcp.cc:133 Could not connect to Client: delllt-fd on delllt:9102. ERR=No route to host
Retrying ...
2018-10-07 22:03:23 kamino-dir JobId 1337: Using Device "FileDevice-1" to write.
2018-10-07 22:03:22 kamino-dir JobId 1337: Start Backup JobId 1337, Job=delllt.2018-10-07_22.00.00_24
2018-10-07 22:03:22 kamino-dir JobId 1337: Secure connection to Storage daemon at kamino:9103 with cipher ECDHE-PSK-CHACHA20-POLY1305 established

Should there not be a timeout waiting and the job should just fail?
TagsNo tags attached.

Activities

wizhippo

wizhippo

2018-10-09 15:12

reporter   ~0003133

Last edited: 2018-10-09 15:32

Trying to cancel hung job even though director shows it's running I get:

*status dir

 JobId Level Name Status
======================================================================
  1337 Full delllt.2018-10-07_22.00.00_24 is waiting for Client to connect (Client Initiated Connection)


*can
Select Job:
     1: JobId=1337 Job=delllt.2018-10-07_22.00.00_24
Choose Job to cancel (1-21): 1
3904 Job delllt.2018-10-07_22.00.00_24 not found.


Had to restart director.

wizhippo

wizhippo

2018-10-09 16:08

reporter   ~0003134

Just to add Connection From Client To Director is not set and I'm not sure why there is a client initiated connection.
wizhippo

wizhippo

2018-10-30 15:28

reporter   ~0003148

I can reproduce this when running a job with higher priority first against a host that is not online and then running the catalog backup afterwards.

The catalog backup never runs as the first jobs fails because the host is unavailable but remains as a running job on the director indefinitely even though failed. A restart of the director is required to remove the job as trying to delete the job in the console shows the job does not exist even though it shows it as running.
r0mulux

r0mulux

2018-12-07 11:41

reporter   ~0003158

Last edited: 2018-12-11 11:20

Hello, I have same issue.
Jobs seems to freeze if machine to backup is not reachable, and next scheduled jobs are never executed. Freezed jobs can not be deleted. Need to restart director each time.

flo

flo

2018-12-17 10:55

reporter   ~0003160

Hello,

I've got the same issue. The bareos director get's stuck when a job runs in an timeout. For example see the job details below.

The job cannot be cancelled manually because the director can't find the job if it's stuck.


I'm on 18.4.1 since 18.2.4rc1 has the same problem. For me it makes bareos unusable because there is a job that has problems nearly every night.





2018-12-17 03:13:16 bareos-dir JobId 340: Start Backup JobId 340, Job=backup-pihole-full.2018-12-17_03.00.01_02
2018-12-17 03:13:16 bareos-dir JobId 340: Connected Storage daemon at bareos:9103, encryption: ECDHE-PSK-CHACHA20-POLY1305
2018-12-17 03:13:16 bareos-dir JobId 340: Using Device "FileStorage" to write.
2018-12-17 03:13:16 bareos-dir JobId 340: Probing... (result will be saved until config reload)
2018-12-17 03:13:16 bareos-dir JobId 340: Connected Client: pihole-fd at pihole:9102, encryption: ECDHE-PSK-CHACHA20-POLY1305
2018-12-17 03:13:16 bareos-dir JobId 340: Handshake: Immediate TLS,
2018-12-17 03:13:16 bareos-dir JobId 340: Encryption: ECDHE-PSK-CHACHA20-POLY1305
2018-12-17 03:13:16 pihole-fd JobId 340: Error: lib/bsock_tcp.cc:192 BnetHost2IpAddrs() for host "bareos" failed: ERR=Temporary failure in name resolution
2018-12-17 03:13:16 pihole-fd JobId 340: Fatal error: Failed to connect to Storage daemon: bareos:9103
2018-12-17 03:13:16 bareos-dir JobId 340: Fatal error: Bad response to Storage command: wanted 2000 OK storage, got 2902 Bad storage
vincebattle

vincebattle

2018-12-28 14:19

reporter   ~0003162

Last edited: 2018-12-28 14:22

Here is a possible workaround to prevent job to get stuck in case of host unreachable.
Edit job definition file (bareos-dir.d/jobdefs/jobName.conf) and add a command to check host before job execution (property "Run Before Job").

JobDefs {
 Name = "jobName"
 Type = [...]
 Pool = [...]
 Messages = [...]
 Run Before Job = "netcat -z -w 2 %h 9102"
 [...]
}

Netcat command has a non-zero exit code if client is not reachable, making Bareos cancel the current job before it tries to connect to client and gets stuck (job status will be "Error").
If client is reachable, Bareos executes job normally.

Command arguments:
 -z     to scan port
 -w 2  to wait at most 2 seconds
 %h    is the job client address
 9102  is default port Bareos uses to connect to client

wizhippo

wizhippo

2018-12-28 17:53

reporter   ~0003163

Since Rc2 I have not had this issue. Not sure if others have found the same.
joergs

joergs

2019-11-11 19:31

developer   ~0003622

This problem has been fixed with 18.2.4-rc2.

Issue History

Date Modified Username Field Change
2018-10-09 15:09 wizhippo New Issue
2018-10-09 15:12 wizhippo Note Added: 0003133
2018-10-09 15:32 wizhippo Note Edited: 0003133
2018-10-09 16:08 wizhippo Note Added: 0003134
2018-10-30 15:28 wizhippo Note Added: 0003148
2018-12-07 11:41 r0mulux Note Added: 0003158
2018-12-11 11:20 r0mulux Note Edited: 0003158
2018-12-17 10:55 flo Note Added: 0003160
2018-12-28 14:19 vincebattle Note Added: 0003162
2018-12-28 14:20 vincebattle Note Edited: 0003162
2018-12-28 14:21 vincebattle Note Edited: 0003162
2018-12-28 14:22 vincebattle Note Edited: 0003162
2018-12-28 14:22 vincebattle Note Edited: 0003162
2018-12-28 14:22 vincebattle Note Edited: 0003162
2018-12-28 14:22 vincebattle Note Edited: 0003162
2018-12-28 17:53 wizhippo Note Added: 0003163
2019-11-11 19:31 joergs Status new => resolved
2019-11-11 19:31 joergs Resolution open => fixed
2019-11-11 19:31 joergs Note Added: 0003622
2019-12-18 15:24 arogge Status resolved => closed