View Issue Details

IDProjectCategoryView StatusLast Update
0000578bareos-core[All Projects] directorpublic2019-01-16 13:13
Reporters2Xk4GAssigned To 
PriorityhighSeveritycrashReproducibilityalways
Status feedbackResolutionopen 
PlatformLinuxOSDebianOS Version8
Product Version15.2.2 
Target VersionFixed in Version 
Summary0000578: Director becomes unresponsible when doing big backups
DescriptionHello,


following situation:

bareos-director/sd 15.2.2-35.1 (@debian jessie) on the backup-server are installed.
PG is the directors backend-DB.

About 110 Jobs per day (mixed Full / Incremental) run fine.
Different clients have mixed filedaemon versions - 15.2.2-XX(@ debian jessie) / 14.X (@debian wheezy). No issues. Absolutely. Job-sizes are up to 800 GB and run fine.

Now i've added a job with 0000003:0000006.5 TB.
The client's filedaemon is also the 15.2.2-35.1. Identical to director and storage-daemon.

Now what's happening:

the job is started, run's well for some hours - 5 to 20(this one from the latest attemp; never made to run so long), and then, at some point in time X, the director becomes unresponsible.

Unresponsible in weird way - i cannot connect using bcosole anymore.
But the log is getting written further.

All jobs that start AFTER that time X are failing (Scheduler fails?)
Long running-Jobs were started BEFORE that time X run properly (Job 2997 in my "jobs.list" listing)


But, according to director logs, this big job is running.
I also see on the client side, that the filedaemon on the client is consuming CPU.


Here Example from the logs, that i'm attaching:
The job 2949 was started at 11:07am. It should backup these 6.5 TB.
It ran until next morning, where about 9:10 i've killed the director.
An another job, Job 2997, which was started at 22:43 ran until 8:31 next morning, was completed successfully(see log).
Steps To Reproducelaunch this job and wait some hours.
TagsNo tags attached.
bareos-master: impact
bareos-master: action
bareos-18.2: impact
bareos-18.2: action
bareos-17.2: impact
bareos-17.2: action
bareos-16.2: impact
bareos-16.2: action
bareos-15.2: impact
bareos-15.2: action
bareos-14.2: impact
bareos-14.2: action
bareos-13.2: impact
bareos-13.2: action
bareos-12.4: impact
bareos-12.4: action

Activities

s2Xk4G

s2Xk4G

2015-12-04 10:10

reporter  

jobs_listing_and_director_log.tgz (25,018 bytes)
s2Xk4G

s2Xk4G

2015-12-04 10:12

reporter   ~0002027

Job size is "6.5TB". Mantis autocorrection has done something weird with my text.
s2Xk4G

s2Xk4G

2015-12-04 10:14

reporter   ~0002028

And, of course, the Subject should be

"Director becomes unresponsive when doing big backups"

and not

"Director becomes unresponsible when doing big backups"
mvwieringen

mvwieringen

2016-01-11 15:48

developer   ~0002088

The logs don't show anything special. You probably want to make a
debug log file using bareos-dir -f -d 100 or 200 but for this particular
problem that is going to be so big we cannot really analyze that as part
of simple non supported environment. So either make your Job smaller
e.g. split it or analyze the debug log yourself and pinpoint it to a
somewhat more concrete problem that might be solvable.
arogge

arogge

2019-01-16 13:13

developer   ~0003188

can you still reproduce this issue with 17.2 or 18.2.4rc2?

Issue History

Date Modified Username Field Change
2015-12-04 10:10 s2Xk4G New Issue
2015-12-04 10:10 s2Xk4G File Added: jobs_listing_and_director_log.tgz
2015-12-04 10:12 s2Xk4G Note Added: 0002027
2015-12-04 10:14 s2Xk4G Note Added: 0002028
2016-01-11 15:48 mvwieringen Note Added: 0002088
2016-01-11 15:48 mvwieringen Status new => feedback
2019-01-16 13:13 arogge Note Added: 0003188