0000578: Director becomes unresponsible when doing big backups - Bareos Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0000578	bareos-core	director	public	2015-12-04 10:10	2019-07-04 15:54

Reporter	s2Xk4G	Assigned To	arogge
Priority	high	Severity	crash	Reproducibility	always
Status	closed	Resolution	suspended
Platform	Linux	OS	Debian	OS Version	8
Product Version	15.2.2

Summary	0000578: Director becomes unresponsible when doing big backups
Description	Hello, following situation: bareos-director/sd 15.2.2-35.1 (@debian jessie) on the backup-server are installed. PG is the directors backend-DB. About 110 Jobs per day (mixed Full / Incremental) run fine. Different clients have mixed filedaemon versions - 15.2.2-XX(@ debian jessie) / 14.X (@debian wheezy). No issues. Absolutely. Job-sizes are up to 800 GB and run fine. Now i've added a job with 0000003:0000006.5 TB. The client's filedaemon is also the 15.2.2-35.1. Identical to director and storage-daemon. Now what's happening: the job is started, run's well for some hours - 5 to 20(this one from the latest attemp; never made to run so long), and then, at some point in time X, the director becomes unresponsible. Unresponsible in weird way - i cannot connect using bcosole anymore. But the log is getting written further. All jobs that start AFTER that time X are failing (Scheduler fails?) Long running-Jobs were started BEFORE that time X run properly (Job 2997 in my "jobs.list" listing) But, according to director logs, this big job is running. I also see on the client side, that the filedaemon on the client is consuming CPU. Here Example from the logs, that i'm attaching: The job 2949 was started at 11:07am. It should backup these 6.5 TB. It ran until next morning, where about 9:10 i've killed the director. An another job, Job 2997, which was started at 22:43 ran until 8:31 next morning, was completed successfully(see log).
Steps To Reproduce	launch this job and wait some hours.
Tags	No tags attached.

s2Xk4G 2015-12-04 10:10 reporter	jobs_listing_and_director_log.tgz (25,018 bytes)

s2Xk4G 2015-12-04 10:12 reporter ~0002027	Job size is "6.5TB". Mantis autocorrection has done something weird with my text.

s2Xk4G 2015-12-04 10:14 reporter ~0002028	And, of course, the Subject should be "Director becomes unresponsive when doing big backups" and not "Director becomes unresponsible when doing big backups"

mvwieringen 2016-01-11 15:48 developer ~0002088	The logs don't show anything special. You probably want to make a debug log file using bareos-dir -f -d 100 or 200 but for this particular problem that is going to be so big we cannot really analyze that as part of simple non supported environment. So either make your Job smaller e.g. split it or analyze the debug log yourself and pinpoint it to a somewhat more concrete problem that might be solvable.

arogge 2019-01-16 13:13 manager ~0003188	can you still reproduce this issue with 17.2 or 18.2.4rc2?

Date Modified	Username	Field	Change
2015-12-04 10:10	s2Xk4G	New Issue
2015-12-04 10:10	s2Xk4G	File Added: jobs_listing_and_director_log.tgz
2015-12-04 10:12	s2Xk4G	Note Added: 0002027
2015-12-04 10:14	s2Xk4G	Note Added: 0002028
2016-01-11 15:48	mvwieringen	Note Added: 0002088
2016-01-11 15:48	mvwieringen	Status	new => feedback
2019-01-16 13:13	arogge	Note Added: 0003188
2019-07-04 15:54	arogge	Assigned To	=> arogge
2019-07-04 15:54	arogge	Status	feedback => closed
2019-07-04 15:54	arogge	Resolution	open => suspended