View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0001192||bareos-core||[All Projects] director||public||2020-02-12 22:45||2020-02-27 11:36|
|Fixed in Version|
|Summary||0001192: Authorization key rejected by Storage daemon since upgrading director and storage daemons|
Since we upgraded all servers components to 19.2.6 (coming from 18.2.7) we get Authorization key rejected by Storage daemon again for random hosts.
There really isn't any reason since those hosts never changed since very long time. Also they were working for months with 18.2.x and has 0 issues. Very strange to see this error. It's only for about 5-10% of the hosts.
web.xxxx.com JobId 9: Fatal error: Authorization key rejected by Storage daemon .
Please see http://doc.bareos.org/master/html/bareos-manual-main-reference.html#AuthorizationErrors for help.
2020-02-12 22:27:47 hostedpower-dir JobId 9: TLS negotiation failed (while probing client protocol)
2020-02-12 22:27:47 hostedpower-dir JobId 9: Connected Client: web.xxxx.com at web.xxxx.com:9102, encryption: None
I think it are all some older ones that don't support encryption, but on the other hand we have others not supporting encryption which work 100% fine.
|Tags||No tags attached.|
I get the same error without tls.. i have 10 jobs on the same Windows FD, connecting to the Same Linux SD 2 of the jobs are failed. restarting the jobs will work after.. looks a temporary error on SD, i try with delay between jobs, but not working.. i increase now the timeout between the jobs from 30 secs to 60 secs
Upgraded from 18.2.6 -> 19.2.4 (windows FD)
Upgraded Server 18.2.7 -> 19.2.5 (Linux SD/Director)
compiling now 19.2.6 and install it....
2020-02-12 22:31:47 linux-dir JobId 13446: Start Backup JobId 13446, Job=BackupWindows.2020-02-12_22.31.45_43
2020-02-12 22:31:47 linux-dir JobId 13446: Connected Storage daemon at arena8:9103, encryption: TLS_CHACHA20_POLY1305_SHA256
2020-02-12 22:31:47 linux-dir JobId 13446: Using Device "WindowsRemote" to write.
2020-02-12 22:31:47 linux-dir JobId 13446: Connected Client: windows-fd at windows:9102, encryption: None
2020-02-12 22:31:47 linux-dir JobId 13446: Handshake: Cleartext
2020-02-12 22:31:47 linux-dir JobId 13446: Encryption: None
2020-02-12 22:31:46 windows-fd JobId 13446: Fatal error: Authorization key rejected windows-fd.
2020-02-12 22:31:46 windows-fd JobId 13446: Fatal error: Failed to authenticate Storage daemon.
|You can just install 19.2.6 from the repositories. Binaries have been released to the public.|
Got the same problem on a 19.2.6 as well
running all daemons in debug mode i figured out, that problem relays server side, when director tries to connect storage daemon sometimes it asks for the tls handshake, but there is no one in request, so it fails
server centos7 dir\sd(vers 19.2.6)
clients debian8-9\centos7(vers 17.2-19.2.6)
i complied and installed 19.2.6 and the windows fd was 19.2.4 got the same error.. did not see errors in the SD Version 19.2.6 trace logs.. so i expect it's the windows fd release 19.2.4. will retry the testing with the 18.2.5 windows fd.... there i never got such errors...
will post the results...
with the downgraded fd on windows i get no errors anymore.. i started my jobs approx 25 times...
git 19.2.6 version director/storage ubuntu linux 18lts
windows fd 18.2.5 version (exe) from download
18.2.5 (30Jan19) Microsoft Windows Server 2012 Datacenter Edition (build 9200), 64-bit,Cross-compile,Win64
with the 19.2.4 version (exe) from download i got the error sometimes (in the same job sequence , (10 jobs) i got it always once)...
|test running now with windows fd 19.2.6 from download...|
with the windows 19.2.6 i got the same error...... after several backups.. ... the error occured, next backup works again...
will downgrade now again to windows fd to 18.2.5 and run the tests again....
got it now also with the 18.2.5 fd .....
linux-dir JobId 13553: Error: Bareos linux-dir 19.2.6 (11Feb20):
Build OS: Linux-4.15.0-76-generic ubuntu Ubuntu 18.04.4 LTS
Backup Level: Incremental, since=2020-02-13 14:15:16
Client: "windows-fd" 18.2.5 (30Jan19) Microsoft Windows Server 2012 Datacenter Edition (build 9200), 64-bit,Cross-compile,Win64
We have pretty much the same problem after upgrading to Bareos 19.2.6.
Clients are mostly CentOS 7 using bacula-client, so unencrypted fd<->sd connections are used.
The failing clients seem to be random - rerunning the job eventually succeeds, sometimes after 5 tries.
This definitely looks like an issue in bareos-sd 19.2, maybe handshake/authentication-related.
It is pretty annoying, since every day around half of our scheduled backups fail.
|Could somebody enable tracing on the components (at least debuglevel 100), reproduce the issue and attach the tracefiles here?|
bareos-sd.trace (32,485 bytes)
could reproduce it installing a linux vm and windows vm, set a small job and on client
tls enable = no
tls require = no
i could reproduce it ever 5 or 6 run of the backup...
bareos-dir.trace (120,750 bytes)
backup_log.txt (2,461 bytes)
18-Feb 16:44 bareos-dir JobId 92: Start Backup JobId 92, Job=backup-win-fd.2020-02-18_16.44.07_30 18-Feb 16:44 bareos-dir JobId 92: Connected Storage daemon at bareos:9103, encryption: TLS_CHACHA20_POLY1305_SHA256 18-Feb 16:44 bareos-dir JobId 92: Using Device "FileStorage" to write. 18-Feb 16:44 bareos-dir JobId 92: Connected Client: win10_1910-fd at 192.168.56.102:9102, encryption: None 18-Feb 16:44 bareos-dir JobId 92: Handshake: Cleartext 18-Feb 16:44 bareos-dir JobId 92: Encryption: None 18-Feb 16:44 win10_1910-fd JobId 92: Created 27 wildcard excludes from FilesNotToBackup Registry key 18-Feb 16:44 win10_1910-fd JobId 92: Fatal error: Authorization key rejected win10_1910-fd. 18-Feb 16:44 win10_1910-fd JobId 92: Fatal error: Failed to authenticate Storage daemon. 18-Feb 16:44 bareos-dir JobId 92: Fatal error: Bad response to Storage command: wanted 2000 OK storage , got 2902 Bad storage 18-Feb 16:44 bareos-dir JobId 92: Error: Bareos bareos-dir 19.2.6 (11Feb20): Build OS: Linux-4.15.0-76-generic ubuntu Ubuntu 18.04.4 LTS JobId: 92 Job: backup-win-fd.2020-02-18_16.44.07_30 Backup Level: Full (upgraded from Incremental) Client: "win10_1910-fd" 19.2.6 (11Feb20) Microsoft Windows 8 Professional (build 9200), 64-bit,Cross-compile,Win64 FileSet: "Win" 2020-02-18 16:07:08 Pool: "Full" (From Job FullPool override) Catalog: "MyCatalog" (From Client resource) Storage: "File" (From Job resource) Scheduled time: 18-Feb-2020 16:44:07 Start time: 18-Feb-2020 16:44:09 End time: 18-Feb-2020 16:44:16 Elapsed time: 7 secs Priority: 10 FD Files Written: 0 SD Files Written: 0 FD Bytes Written: 0 (0 B) SD Bytes Written: 0 (0 B) Rate: 0.0 KB/s Software Compression: None VSS: yes Encryption: no Accurate: no Volume name(s): Volume Session Id: 12 Volume Session Time: 1582040123 Last Volume Bytes: 0 (0 B) Non-fatal FD errors: 2 SD Errors: 0 FD termination status: Fatal Error SD termination status: Waiting on FD Bareos binary info: self-compiled: Get official binaries and vendor support on bareos.com Termination: *** Backup Error ***
backup_log.txt (2,461 bytes)win10_1910-fd.trace (238,161 bytes)
it looks it's the
tls enable = no
tls require = no
after i removed this from client and director.. i could start 40 backups against the file daemon without any backup error. if i switch back to
tls enable = no
tls require = no
after the 5 backup, the error occurs...
|Just to make sure i can reproduce this correctly: where do you set these? In the Client resource on the director and in the Director resource on the filedaemon?|
|yes, i did it in the client resource of the director and on the client in the director and myself.conf|
|In my minimal test-case I cannot make it fail (yet). Can you reproduce the problem when dir, fd and sd are on the same machine, or do you need networking in between?|
i had linux vm (DIR,SD) and a windows vm (FD)
yesterday i run my test "simple backup" until error. for approx 200 successfull backups with 19.2.6 on my test setup.
After the successfull test, i switched our productive system again from 18.6 to 19.2.6 without the tls_enable = no everywhere and its working since then without any error.
|It looks like sometimes when a non-TLS client connects to the SD the authentication fails. I'm just not sure how this can fail only sometimes.|
it's really like this.. if you set "tls enable=no" after starting 5-6 times the same job you get the error. The Same 2 VM (virtual box), same resources, only thing running on my laptop.... .not touching anything...
timing issue? if you cannot get it with localhost?
Seems like a restart of the SD between backups works around the issue. I made the test not do this anymore and now I can reproduce the issue.
Thanks for taking the time to reproduce this. I'll keep you updated as soon as I know anything new.
When debugging the SD with debuglevel 200 or above, you should see "lib/try_tls_handshake_as_a_server.cc:77-0 Connection to %s will be denied due to configuration mismatch". This message occurs when the connection does not satisfy the TlsPolicy.
The TlsPolicy that is checked should be set by ConfiguredTlsPolicyGetter::GetConfiguredTlsPolicyFromCleartextHello(). Seems like there are cases where it isn't set correctly or maybe the value gets corrupted.
|Seems like the value was never initialized in the first place. We now have a proposed fix for this.|
We have packages with the proposed fix available at https://download.bareos.org/bareos/experimental/CD/PR-431/
Would anyone be willing to test with these? The change is only in the SD, so upgrading that is sufficient.
|will test it tomorrow with the packages. is the fix already in the master branch of git?|
Tests run successfull
run 80 jobs tls enable = no -> no errros
run 40 jobs tls enabled -> no errors
thanks for fixing....
|Thanks for the feedback. I'll merge the patch into master and backport it to 19.2 and 18.2.|
|2020-02-12 22:45||hostedpower||New Issue|
|2020-02-13 07:15||ojehle||Note Added: 0003803|
|2020-02-13 07:17||ojehle||Note Edited: 0003803||View Revisions|
|2020-02-13 07:29||ojehle||Note Edited: 0003803||View Revisions|
|2020-02-13 10:04||arogge||Note Added: 0003806|
|2020-02-13 11:24||buloosh||Note Added: 0003807|
|2020-02-13 11:29||ojehle||Note Added: 0003808|
|2020-02-13 11:40||buloosh||Note Edited: 0003807||View Revisions|
|2020-02-13 13:01||ojehle||Note Added: 0003810|
|2020-02-13 13:02||ojehle||Note Edited: 0003810||View Revisions|
|2020-02-13 13:13||ojehle||Note Added: 0003811|
|2020-02-13 14:10||ojehle||Note Added: 0003812|
|2020-02-13 14:28||ojehle||Note Added: 0003813|
|2020-02-18 14:54||jkk||Note Added: 0003826|
|2020-02-18 15:16||arogge||Note Added: 0003827|
|2020-02-18 16:51||ojehle||File Added: bareos-sd.trace|
|2020-02-18 16:51||ojehle||File Added: bareos-dir.trace|
|2020-02-18 16:51||ojehle||File Added: backup_log.txt|
|2020-02-18 16:51||ojehle||File Added: win10_1910-fd.trace|
|2020-02-18 16:51||ojehle||Note Added: 0003830|
|2020-02-18 17:29||ojehle||Note Added: 0003831|
|2020-02-18 17:36||arogge||Note Added: 0003832|
|2020-02-18 17:42||ojehle||Note Added: 0003833|
|2020-02-18 20:57||ojehle||Note Edited: 0003830||View Revisions|
|2020-02-19 09:19||arogge||Note Added: 0003836|
|2020-02-19 09:30||ojehle||Note Added: 0003837|
|2020-02-19 09:44||arogge||Note Added: 0003838|
|2020-02-19 09:48||ojehle||Note Added: 0003839|
|2020-02-19 11:41||arogge||Note Added: 0003841|
|2020-02-19 11:41||arogge||Assigned To||=> arogge|
|2020-02-19 11:41||arogge||Status||new => confirmed|
|2020-02-20 14:00||arogge||Relationship added||has duplicate 0001201|
|2020-02-21 20:14||arogge||Note Added: 0003850|
|2020-02-25 13:38||arogge||Note Added: 0003855|
|2020-02-26 18:23||arogge||Note Added: 0003866|
|2020-02-26 18:24||arogge||Status||confirmed => feedback|
|2020-02-26 18:48||ojehle||Note Added: 0003868|
|2020-02-27 08:26||ojehle||Note Added: 0003872|
|2020-02-27 11:36||arogge||Note Added: 0003873|