Bareos Bug Tracker - bareos-core
View Issue Details
0001019bareos-core[All Projects] directorpublic2018-10-09 15:092018-12-17 10:55
wizhippo 
 
normalmajoralways
newopen 
x86Ubuntu18.04
18.2.4-rc1 
 
0001019: Director hangs waiting for client if not available PSK
Using TLS Psk Require = yes if a client is offline the director hangs waiting with:

delllt.2018-10-07_22.00.00_24 is waiting for Client to connect (Client Initiated Connection)

The log in bareos gui shows:

2018-10-07 22:06:47 kamino-dir JobId 1337:
Try to establish a secure connection by immediate TLS handshake:
2018-10-07 22:06:47 kamino-dir JobId 1337: Fatal error: Failed to connect to client "delllt-fd".
2018-10-07 22:06:35 kamino-dir JobId 1337: Fatal error: lib/bsock_tcp.cc:139 Unable to connect to Client: delllt-fd on delllt:9102. ERR=No route to host
2018-10-07 22:03:39 kamino-dir JobId 1337: Warning: lib/bsock_tcp.cc:133 Could not connect to Client: delllt-fd on delllt:9102. ERR=No route to host
Retrying ...
2018-10-07 22:03:23 kamino-dir JobId 1337: Using Device "FileDevice-1" to write.
2018-10-07 22:03:22 kamino-dir JobId 1337: Start Backup JobId 1337, Job=delllt.2018-10-07_22.00.00_24
2018-10-07 22:03:22 kamino-dir JobId 1337: Secure connection to Storage daemon at kamino:9103 with cipher ECDHE-PSK-CHACHA20-POLY1305 established

Should there not be a timeout waiting and the job should just fail?
No tags attached.
Issue History
2018-10-09 15:09wizhippoNew Issue
2018-10-09 15:12wizhippoNote Added: 0003133
2018-10-09 15:32wizhippoNote Edited: 0003133bug_revision_view_page.php?bugnote_id=3133#r354
2018-10-09 16:08wizhippoNote Added: 0003134
2018-10-30 15:28wizhippoNote Added: 0003148
2018-12-07 11:41r0muluxNote Added: 0003158
2018-12-11 11:20r0muluxNote Edited: 0003158bug_revision_view_page.php?bugnote_id=3158#r358
2018-12-17 10:55floNote Added: 0003160

Notes
(0003133)
wizhippo   
2018-10-09 15:12   
(edited on: 2018-10-09 15:32)
Trying to cancel hung job even though director shows it's running I get:

*status dir

 JobId Level Name Status
======================================================================
  1337 Full delllt.2018-10-07_22.00.00_24 is waiting for Client to connect (Client Initiated Connection)


*can
Select Job:
     1: JobId=1337 Job=delllt.2018-10-07_22.00.00_24
Choose Job to cancel (1-21): 1
3904 Job delllt.2018-10-07_22.00.00_24 not found.


Had to restart director.

(0003134)
wizhippo   
2018-10-09 16:08   
Just to add Connection From Client To Director is not set and I'm not sure why there is a client initiated connection.
(0003148)
wizhippo   
2018-10-30 15:28   
I can reproduce this when running a job with higher priority first against a host that is not online and then running the catalog backup afterwards.

The catalog backup never runs as the first jobs fails because the host is unavailable but remains as a running job on the director indefinitely even though failed. A restart of the director is required to remove the job as trying to delete the job in the console shows the job does not exist even though it shows it as running.
(0003158)
r0mulux   
2018-12-07 11:41   
(edited on: 2018-12-11 11:20)
Hello, I have same issue.
Jobs seems to freeze if machine to backup is not reachable, and next scheduled jobs are never executed. Freezed jobs can not be deleted. Need to restart director each time.

(0003160)
flo   
2018-12-17 10:55   
Hello,

I've got the same issue. The bareos director get's stuck when a job runs in an timeout. For example see the job details below.

The job cannot be cancelled manually because the director can't find the job if it's stuck.


I'm on 18.4.1 since 18.2.4rc1 has the same problem. For me it makes bareos unusable because there is a job that has problems nearly every night.





2018-12-17 03:13:16 bareos-dir JobId 340: Start Backup JobId 340, Job=backup-pihole-full.2018-12-17_03.00.01_02
2018-12-17 03:13:16 bareos-dir JobId 340: Connected Storage daemon at bareos:9103, encryption: ECDHE-PSK-CHACHA20-POLY1305
2018-12-17 03:13:16 bareos-dir JobId 340: Using Device "FileStorage" to write.
2018-12-17 03:13:16 bareos-dir JobId 340: Probing... (result will be saved until config reload)
2018-12-17 03:13:16 bareos-dir JobId 340: Connected Client: pihole-fd at pihole:9102, encryption: ECDHE-PSK-CHACHA20-POLY1305
2018-12-17 03:13:16 bareos-dir JobId 340: Handshake: Immediate TLS,
2018-12-17 03:13:16 bareos-dir JobId 340: Encryption: ECDHE-PSK-CHACHA20-POLY1305
2018-12-17 03:13:16 pihole-fd JobId 340: Error: lib/bsock_tcp.cc:192 BnetHost2IpAddrs() for host "bareos" failed: ERR=Temporary failure in name resolution
2018-12-17 03:13:16 pihole-fd JobId 340: Fatal error: Failed to connect to Storage daemon: bareos:9103
2018-12-17 03:13:16 bareos-dir JobId 340: Fatal error: Bad response to Storage command: wanted 2000 OK storage, got 2902 Bad storage