0000937: estimate causes error if time on filedaemon and director are out of sync

ID	Project	Category	View Status	Date Submitted	Last Update

0000937	bareos-core	file daemon	public	2018-04-08 02:50	2023-08-02 10:26

Reporter	bozonius	Assigned To	bruno-at-bareos
Priority	low	Severity	minor	Reproducibility	always
Status	closed	Resolution	unable to reproduce
Platform	Linux	OS	Ubuntu	OS Version	14.04
Product Version	16.2.4

Summary	0000937: estimate causes error if time on filedaemon and director are out of sync
Description	If the bareos-dir system and the bareos-fd system are not in time sync (by more than hardcoded 3 second threshold), the file daemon will generate an error message like "JobId 0: Error: getmsg.c:212 Malformed message: DIR and FD clocks differ by -35 seconds, FD automatically compensating" where the call is Jmsg(jcr, type, 0, _("DIR and FD clocks differ by %lld seconds, FD automatically compensating.\n"), adj); in the file daemon from about line 1565 in dir_cmd.cc. The problem is that the director code handles this message as an exception case but cannot parse it at about line 212 in bget_dirmsg() in the director code. In my configuration, it generates an email and a log entry. In some ways, it is helpful to get these alerts in emails, but the message could be handled more consistently like others the director expects, and it would be nice if the client were identified. Otherwise, it is guesswork figuring out which client is out of time sync. For me, there are only a few clients, but a data center might have a greater challenge. Of course, I'd expect a data center to have this sort of business much better under control... This is low priority and mainly a nuisance. It should not be too difficult to either 1) change the calling code (level_cmd in dir_cmd.cc) to issue an error message in a format that is parsable by the catchall case in the director, or 2) add a specific case handler for time sync problem in the director code (bget_dirmsg in getmsg.cc). It would be very helpful to include the client name or backup name in the message to aid in resolving such an issue.
Steps To Reproduce	In bconsole, run echo 'estimate accurate=yes level=incremental job=backup_name' \| bconsole where client "backup_name" is a system whose time is out of sync with the director by more than 3 seconds. See resulting bareos.log for the message (assuming Daemon Messages are configured to log errors).
Tags	No tags attached.

~~aron_s~~ 2018-04-27 13:26 reporter ~0002975	I could not reproduce this problem. In my test environment, Director and storage daemon both run virtually and the filedaemon runs on a OS X system. Time was set 30 seconds off, yet the estimation did not return any errors.

joergs 2018-04-27 14:56 developer ~0002979	A quick look at the code indicates, that this warning is only produced if level=since, not level=incremental. Are you sure it happens on your system with level=incremental?

bozonius 2018-04-27 18:28 reporter ~0002981	@joergs: level can be incremental or full. I've not tried other variations. ~~@aron_s~~: maybe the code has been corrected since the release I am using (you did not mention yours)? I did not do version comparisons of source code.

joergs 2018-04-27 18:47 developer ~0002982	I checked against 16.2.7 and now also verified that the code has not changed between 16.2.4 and 16.2.7. Setting the filed to debug level 10 (very much output!) will contain the line "level_cmd: ...". It must be since_utime in your case, maybe because of accurate. Debug level can be changed during runtime using setdebug client=<NAME> level=10 trace=1 Result will be written to a trace file.

bozonius 2018-04-28 02:20 reporter ~0002983	@joergs: I just set one of the clients to a time about 10 mins earlier than the director's (and storage's, since they are on the same system). I re-ran my experiment, just as before, and I get the same result as I reported originally. I tried the setdebug as you described. The audit log indicates that the command was received; no error there. So I am assuming that debugging is set, yet the main bareos log does not show any new messages other than the usual ones. The client I tested against is running bareos-fd 14.2.1, which is a bit older, and may explain why you don't see the same problem. And also maybe why the debugging doesn't kick in. I have upgraded the client as much as I can using the repos, short of building the client myself. I'll see if anti-X might have newer package updates for the client. The other clients are also at various levels of the client (filedaemon).

bozonius 2018-04-28 02:26 reporter ~0002984	Would running bareos in compatibility mode help to eliminate these messages, or would that just introduce more problems?

joergs 2018-04-28 12:40 developer ~0002985	Debug is not written to the main bareos.log file, but to a local trace file. Issuing the setdebug command tells you where to find this file. Compatibility mode should have no impact on this. So the error did only occur with old bareos filedaemons? Sorry, but the version should be at least 16.2, better 7.2 or master (experimental, nightly) to be relevant for us. What platform are the clients running on?

bozonius 2018-04-28 13:41 reporter ~0002986	@joergs: "Local?" to what? Local to the system where the setdebug was run, or local to where the filedaemon is running? Or local to where the director is running? I just ran setdebug again, and it did not indicate where the output was going. Could not parse: at least 16.2, better 7.2 -- what does this mean? Anti-X repo version is 14.2.1. There is no update available. However, keep in mind that other clients have seen this error, and they are running other versions of the file daemon software (depending on the distro). There's a couple of debian variants, and an arch-based system.

joergs 2018-04-28 15:24 developer ~0002987	Local: local on the system that runs the daemon. In case of "setdebug client=>NAME>" it is on the client. at least 16.2, better 7.2 => at least 16.2, better 17.2

bozonius 2018-04-28 18:56 reporter ~0002988 Last edited: 2018-04-28 19:07	OK, so I have setdebug as requested, found the trace file (it was not obvious to me where "current directory" might be), and yes, found since_utime= entries in the output. But I assure you I am running these backups ONLY as full or incremental. I do not use any others (even differential). Here is an extract from the trace file: $ grep level_cmd /shared/antix-fd.trace antix-fd: dir_cmd.c:1228-0 level_cmd: level = accurate_incremental mtime_only=0 antix-fd: dir_cmd.c:1228-0 level_cmd: level = since_utime 1524830415 mtime_only=0 prev_job=antix-sys.2018-04-27_05.00.00_10 antix-fd: dir_cmd.c:1228-0 level_cmd: level = accurate_incremental mtime_only=0 antix-fd: dir_cmd.c:1228-0 level_cmd: level = since_utime 1524830415 mtime_only=0 prev_job=antix-sys.2018-04-27_05.00.00_10 antix-fd: dir_cmd.c:1228-0 level_cmd: level = incremental mtime_only=0 antix-fd: dir_cmd.c:1228-0 level_cmd: level = since_utime 1524830415 mtime_only=0 prev_job=antix-sys.2018-04-27_05.00.00_10 antix-fd: dir_cmd.c:1228-0 level_cmd: level = accurate_incremental mtime_only=0 antix-fd: dir_cmd.c:1228-0 level_cmd: level = since_utime 1524830438 mtime_only=0 prev_job=antix-user.2018-04-27_05.00.01_11 antix-fd: dir_cmd.c:1228-0 level_cmd: level = incremental mtime_only=0 antix-fd: dir_cmd.c:1228-0 level_cmd: level = since_utime 1524830438 mtime_only=0 prev_job=antix-user.2018-04-27_05.00.01_11

bozonius 2018-05-22 00:39 reporter ~0003009	This error has surfaced again, today. I have searched the logs in /var/log/bareos, /var/log/syslog, and my own logging and cannot force an explanation out of any of them. The strangest part of this is that the first of two jobs failed on just two Vbox VM filedaemons, the other of the two ran fine on each of the two systems. These run smoothly every day, nearly forever (I'm not complaining, believe me). It's just that every so often, I see this crop up. I might not see it again for a long time. I note that bareos tools do more audit logging in some tools than others. For instance, the audit log for today is relatively "quiet" for the jobs I run overnight, but the log gets noisy, logging every SQL query when I run BAT. I will see if I can get more "noise" from bconsole. The director (dir 16.2.4) and storage (sd 16.2.4) daemons are still running on Ubuntu 14.04. The clients affected today were Devuan Jessie (fd 14.2.1) and Anti-X (fd 14.2.1). As I mentioned, these backups run without serious* issues over 99% of the time, which makes this an intermittent problem, and thus one that is hard to nail down the source of. Increasing the verbosity of the tools to the logs will be helpful, and I will look into this. (* sometimes clocks are off, but this presents no serious problem; the jobs do not fail in those instances)

bozonius 2018-05-30 09:07 reporter ~0003021	I have debug logging running on the director (I may add logging for other daemons later if needed). Log level is set to 200. It would be helpful to have a listing of debug levels. I want to be sure to gather enough debugging info to pin down problems when they come up, but not so much that it will eat disk space unnecessarily. Now I wait for the next time there is a problem and see if there is adequate log output to analyze it.

bozonius 2018-06-03 04:25 reporter ~0003023	I have now enabled debugging on calls to bconsole for one backup which fails as described. This might be helpful, too, since the failure occurs while bconsole is running.

bozonius 2018-06-04 01:31 reporter ~0003024	I have now enabled debugging on all jobs.

bozonius 2018-06-04 01:34 reporter ~0003025	I just realized this wasn't the bug I was trying to address with the debug statements (though they will help perhaps). The other problem I have is with backups that fail in bconsole (not just issue warning) when requesting an estimate. At any rate, maybe this additional logging will help going forward for any sort of issue that arises. The logging does not add a significant amount of disk usage at level 200.

~~aron_s~~ 2018-06-04 13:41 reporter ~0003026	So I reproduced this again and got the same warning, but with the name of the filedaemon as source of the warning. Identifying the out-of-sync client is doable. The job itself also runs perfectly fine. I get the following output when running a job on a client with time difference > 60s: 04-Jun 13:35 ubuntu1604-VirtualBox-dir JobId 10: Start Backup JobId 10, Job=macbook-log-backup.2018-06-04_13.35.08_12 04-Jun 13:35 ubuntu1604-VirtualBox-dir JobId 10: Created new Volume "Incremental-0002" in catalog. 04-Jun 13:35 ubuntu1604-VirtualBox-dir JobId 10: Using Device "FileStorage" to write. 04-Jun 13:32 macbook-fd JobId 10: DIR and FD clocks differ by -180 seconds, FD automatically compensating. 04-Jun 13:32 macbook-fd JobId 10: Warning: LZ4 compression support requested in fileset but not available on this platform. Disabling ... 04-Jun 13:35 ubuntu1604-VirtualBox-sd JobId 10: Labeled new Volume "Incremental-0002" on device "FileStorage" (/var/lib/bareos/storage). 04-Jun 13:35 ubuntu1604-VirtualBox-sd JobId 10: Wrote label to prelabeled Volume "Incremental-0002" on device "FileStorage" (/var/lib/bareos/storage) 04-Jun 13:35 ubuntu1604-VirtualBox-sd JobId 10: Elapsed time=00:00:01, Transfer rate=19.44 K Bytes/second 04-Jun 13:35 ubuntu1604-VirtualBox-dir JobId 10: sql_create.c:872 Insert of attributes batch table done 04-Jun 13:35 ubuntu1604-VirtualBox-dir JobId 10: Bareos ubuntu1604-VirtualBox-dir 17.2.4 (21Sep17): Build OS: i686-pc-linux-gnu ubuntu Ubuntu 16.04 LTS JobId: 10 Job: macbook-log-backup.2018-06-04_13.35.08_12 Backup Level: Incremental, since=2018-05-17 13:11:57 Client: "macbook-fd" 17.2.4 (21Sep17) x86_64-apple-darwin17.4.0,osx,17.4.0 FileSet: "MacbookLogFileset" 2018-05-17 13:00:50 Pool: "Incremental" (From Job IncPool override) Catalog: "MyCatalog" (From Client resource) Storage: "ubuntu1604-VirtualBox-sd" (From Job resource) Scheduled time: 04-Jun-2018 13:35:06 Start time: 04-Jun-2018 13:35:14 End time: 04-Jun-2018 13:35:14 Elapsed time: 0 secs Priority: 10 FD Files Written: 7 SD Files Written: 7 FD Bytes Written: 18,515 (18.51 KB) SD Bytes Written: 19,448 (19.44 KB) Rate: 0.0 KB/s Software Compression: None VSS: no Encryption: no Accurate: no Volume name(s): Incremental-0002 Volume Session Id: 1 Volume Session Time: 1528110773 Last Volume Bytes: 20,363 (20.36 KB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK

bozonius 2018-06-06 23:04 reporter ~0003033 Last edited: 2018-06-06 23:13	~~@aron_s~~: Thanks for trying this again. BTW, there are actually two problems here (may be related, maybe not). The original issue was that I got email message indicating times were out of sync between client and director systems, but that message did not name the client. The actual run of the backup worked fine, just as you confirm. This problem is, ultimately, minor though annoying. The more vexing problem is the other one. Many times, a call to bconsole for an estimate of backup size fails. I am now able to compare debug traces of normal (non-failing) job and failing job (debug level is 200). So far, my investigation reveals "bconsole (100): console.c:346-0 Got poll BNET_EOD" appears in the trace only once for a normal job, but twice for the failing job. This "once versus twice" behavior seems to be consistent in comparing failed to normal jobs. I note it is possible the 2nd message, if it is indeed actually sent by the director, might be getting swallowed by my logging as the result, perhaps, of my script closing the pipe before some final flush of messages to my script, or something like that. I'll double-check the perl code I wrote, but I have been running this script for years now and has always had this intermittent results from the 'estimate' command. bconsole estimate is called from perl as a "Command Before Job". In both scenarios, bconsole connects successfully with the director. I can see the 'estimate' request in the director's trace in both instances. Is there a much better way to obtain the estimate? If I end up with some run-away job (meaning huge amounts of data have appeared since the last run), I don't want to end up eating gobs of backup disk space (and time!). One example of this is backing up the home directory where, say, a program has experienced extremely increased amounts of data in its files. I have excluded a long list of files and directories in all backups, but every so often another one would come along that I had overlooked that just happened to bollix the works. This call to bconsole for an estimate prior to the run is just an additional prevention.

bozonius 2019-12-03 17:24 reporter ~0003645 Last edited: 2019-12-05 19:05	I have (finally) gotten around to upgrading my backups. I am using bareos 18.2.5 on the host, and that comes from the bareos repo, not my distro's (Devuan). The clients are running bareos 16.2.4, 16.2.6, or 17.2.5; these come from the respective distros, but they and the director are all running on Linux. This morning, I was greeted with messages that the file daemons were out of sync on 3 of the clients. Again, I am not certain which ones, because the messages indicate neither the client involved nor even the job id (which is always zero in these messages). I think I'd like to change the title of this issue to something like "phantom error messages due to client clocks out of sync with director" which would be much clearer. I did not know, when I first filed this bug, what the nature of the problem was in the first place. These messages do not represent true errors as such since the system adjusts itself anyway. But the messages are unhelpful as they stand; even if I were to investigate the time disparities, I won't get too far since I can't tell which one of the clients are out of sync. Is it possible to either eliminate these messages altogether (since the situations they represent seem to be self-healing), or to include either the job number or the client name (or both) in these messages? I'd be happy to go around fixing my time problems, but generally I do not witness significant time differences by the time I go to investigate. As far as I know, all my clients are sync'ing to one time server on my network.

bruno-at-bareos 2023-07-18 14:15 manager ~0005205	Is this reproducible with supported OS and recent Bareos 22.1.0 code ? Did you ever tried to use a runscript command (one on the FD, one on the dir to show date/time when job start?)

bruno-at-bareos 2023-08-02 10:26 manager ~0005298	closing no feedback, nor being still a problem with recent code.

Date Modified	Username	Field	Change
2018-04-08 02:50	bozonius	New Issue
2018-04-27 13:26	~~aron_s~~	Note Added: 0002975
2018-04-27 14:56	joergs	Note Added: 0002979
2018-04-27 14:56	joergs	Status	new => feedback
2018-04-27 18:28	bozonius	Note Added: 0002981
2018-04-27 18:28	bozonius	Status	feedback => new
2018-04-27 18:47	joergs	Note Added: 0002982
2018-04-28 02:20	bozonius	Note Added: 0002983
2018-04-28 02:26	bozonius	Note Added: 0002984
2018-04-28 12:40	joergs	Note Added: 0002985
2018-04-28 12:40	joergs	Status	new => feedback
2018-04-28 13:41	bozonius	Note Added: 0002986
2018-04-28 13:41	bozonius	Status	feedback => new
2018-04-28 15:24	joergs	Note Added: 0002987
2018-04-28 18:56	bozonius	Note Added: 0002988
2018-04-28 19:07	bozonius	Note Edited: 0002988
2018-05-22 00:39	bozonius	Note Added: 0003009
2018-05-30 09:07	bozonius	Note Added: 0003021
2018-06-03 04:25	bozonius	Note Added: 0003023
2018-06-04 01:31	bozonius	Note Added: 0003024
2018-06-04 01:34	bozonius	Note Added: 0003025
2018-06-04 13:41	~~aron_s~~	Note Added: 0003026
2018-06-06 23:04	bozonius	Note Added: 0003033
2018-06-06 23:13	bozonius	Note Edited: 0003033
2019-12-03 17:24	bozonius	Note Added: 0003645
2019-12-05 19:05	bozonius	Note Edited: 0003645
2023-07-18 14:15	bruno-at-bareos	Assigned To	=> bruno-at-bareos
2023-07-18 14:15	bruno-at-bareos	Status	new => feedback
2023-07-18 14:15	bruno-at-bareos	Note Added: 0005205
2023-08-02 10:26	bruno-at-bareos	Status	feedback => closed
2023-08-02 10:26	bruno-at-bareos	Resolution	open => unable to reproduce
2023-08-02 10:26	bruno-at-bareos	Note Added: 0005298

Reporting new Issues is disabled, please Report new Issues at https://github.com/bareos/bareos/issues

View Issue Details

Activities

Issue History