View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001019 | bareos-core | director | public | 2018-10-09 15:09 | 2019-12-18 15:24 |
Reporter | wizhippo | Assigned To | |||
Priority | normal | Severity | major | Reproducibility | always |
Status | closed | Resolution | fixed | ||
Platform | x86 | OS | Ubuntu | OS Version | 18.04 |
Product Version | 18.2.4-rc1 | ||||
Summary | 0001019: Director hangs waiting for client if not available PSK | ||||
Description | Using TLS Psk Require = yes if a client is offline the director hangs waiting with: delllt.2018-10-07_22.00.00_24 is waiting for Client to connect (Client Initiated Connection) The log in bareos gui shows: 2018-10-07 22:06:47 kamino-dir JobId 1337: Try to establish a secure connection by immediate TLS handshake: 2018-10-07 22:06:47 kamino-dir JobId 1337: Fatal error: Failed to connect to client "delllt-fd". 2018-10-07 22:06:35 kamino-dir JobId 1337: Fatal error: lib/bsock_tcp.cc:139 Unable to connect to Client: delllt-fd on delllt:9102. ERR=No route to host 2018-10-07 22:03:39 kamino-dir JobId 1337: Warning: lib/bsock_tcp.cc:133 Could not connect to Client: delllt-fd on delllt:9102. ERR=No route to host Retrying ... 2018-10-07 22:03:23 kamino-dir JobId 1337: Using Device "FileDevice-1" to write. 2018-10-07 22:03:22 kamino-dir JobId 1337: Start Backup JobId 1337, Job=delllt.2018-10-07_22.00.00_24 2018-10-07 22:03:22 kamino-dir JobId 1337: Secure connection to Storage daemon at kamino:9103 with cipher ECDHE-PSK-CHACHA20-POLY1305 established Should there not be a timeout waiting and the job should just fail? | ||||
Tags | No tags attached. | ||||
Trying to cancel hung job even though director shows it's running I get: *status dir JobId Level Name Status ====================================================================== 1337 Full delllt.2018-10-07_22.00.00_24 is waiting for Client to connect (Client Initiated Connection) *can Select Job: 1: JobId=1337 Job=delllt.2018-10-07_22.00.00_24 Choose Job to cancel (1-21): 1 3904 Job delllt.2018-10-07_22.00.00_24 not found. Had to restart director. |
|
Just to add Connection From Client To Director is not set and I'm not sure why there is a client initiated connection. | |
I can reproduce this when running a job with higher priority first against a host that is not online and then running the catalog backup afterwards. The catalog backup never runs as the first jobs fails because the host is unavailable but remains as a running job on the director indefinitely even though failed. A restart of the director is required to remove the job as trying to delete the job in the console shows the job does not exist even though it shows it as running. |
|
Hello, I have same issue. Jobs seems to freeze if machine to backup is not reachable, and next scheduled jobs are never executed. Freezed jobs can not be deleted. Need to restart director each time. |
|
Hello, I've got the same issue. The bareos director get's stuck when a job runs in an timeout. For example see the job details below. The job cannot be cancelled manually because the director can't find the job if it's stuck. I'm on 18.4.1 since 18.2.4rc1 has the same problem. For me it makes bareos unusable because there is a job that has problems nearly every night. 2018-12-17 03:13:16 bareos-dir JobId 340: Start Backup JobId 340, Job=backup-pihole-full.2018-12-17_03.00.01_02 2018-12-17 03:13:16 bareos-dir JobId 340: Connected Storage daemon at bareos:9103, encryption: ECDHE-PSK-CHACHA20-POLY1305 2018-12-17 03:13:16 bareos-dir JobId 340: Using Device "FileStorage" to write. 2018-12-17 03:13:16 bareos-dir JobId 340: Probing... (result will be saved until config reload) 2018-12-17 03:13:16 bareos-dir JobId 340: Connected Client: pihole-fd at pihole:9102, encryption: ECDHE-PSK-CHACHA20-POLY1305 2018-12-17 03:13:16 bareos-dir JobId 340: Handshake: Immediate TLS, 2018-12-17 03:13:16 bareos-dir JobId 340: Encryption: ECDHE-PSK-CHACHA20-POLY1305 2018-12-17 03:13:16 pihole-fd JobId 340: Error: lib/bsock_tcp.cc:192 BnetHost2IpAddrs() for host "bareos" failed: ERR=Temporary failure in name resolution 2018-12-17 03:13:16 pihole-fd JobId 340: Fatal error: Failed to connect to Storage daemon: bareos:9103 2018-12-17 03:13:16 bareos-dir JobId 340: Fatal error: Bad response to Storage command: wanted 2000 OK storage, got 2902 Bad storage |
|
Here is a possible workaround to prevent job to get stuck in case of host unreachable. Edit job definition file (bareos-dir.d/jobdefs/jobName.conf) and add a command to check host before job execution (property "Run Before Job"). JobDefs { Name = "jobName" Type = [...] Pool = [...] Messages = [...] Run Before Job = "netcat -z -w 2 %h 9102" [...] } Netcat command has a non-zero exit code if client is not reachable, making Bareos cancel the current job before it tries to connect to client and gets stuck (job status will be "Error"). If client is reachable, Bareos executes job normally. Command arguments: -z to scan port -w 2 to wait at most 2 seconds %h is the job client address 9102 is default port Bareos uses to connect to client |
|
Since Rc2 I have not had this issue. Not sure if others have found the same. | |
This problem has been fixed with 18.2.4-rc2. | |
Date Modified | Username | Field | Change |
---|---|---|---|
2018-10-09 15:09 | wizhippo | New Issue | |
2018-10-09 15:12 | wizhippo | Note Added: 0003133 | |
2018-10-09 15:32 | wizhippo | Note Edited: 0003133 | |
2018-10-09 16:08 | wizhippo | Note Added: 0003134 | |
2018-10-30 15:28 | wizhippo | Note Added: 0003148 | |
2018-12-07 11:41 | r0mulux | Note Added: 0003158 | |
2018-12-11 11:20 | r0mulux | Note Edited: 0003158 | |
2018-12-17 10:55 | flo | Note Added: 0003160 | |
2018-12-28 14:19 | vincebattle | Note Added: 0003162 | |
2018-12-28 14:20 | vincebattle | Note Edited: 0003162 | |
2018-12-28 14:21 | vincebattle | Note Edited: 0003162 | |
2018-12-28 14:22 | vincebattle | Note Edited: 0003162 | |
2018-12-28 14:22 | vincebattle | Note Edited: 0003162 | |
2018-12-28 14:22 | vincebattle | Note Edited: 0003162 | |
2018-12-28 14:22 | vincebattle | Note Edited: 0003162 | |
2018-12-28 17:53 | wizhippo | Note Added: 0003163 | |
2019-11-11 19:31 | joergs | Status | new => resolved |
2019-11-11 19:31 | joergs | Resolution | open => fixed |
2019-11-11 19:31 | joergs | Note Added: 0003622 | |
2019-12-18 15:24 | arogge | Status | resolved => closed |