View Issue Details

IDProjectCategoryView StatusLast Update
0000259bareos-coreGeneralpublic2014-01-23 09:20
ReporterTomWork Assigned To 
PrioritynormalSeverityminorReproducibilitysometimes
Status closedResolutionno change required 
PlatformLinuxOSCentOSOS Version6
Product Version13.2.2 
Summary0000259: bconsole memory usage and slowness
Description*Short version*
When we execute a lot of parallels bconsole from time to time they become very slow and start to chew a lot of memory. We only execute bconsole commands like the following : sh -c echo 'list jobname=$FQDN' | /usr/sbin/bconsole

*Long version*
- Context:
We use nagios. NRPE is a nagios client that executes bconsole to get the status of completed jobs for each monitored host. Nagios will check every 5min each of the hosts/jobs we have (173 IIRC). Nrpe will fork a checker script for each of the host. We use check_bacula_lastbackup.pl v1.0 which will mainly execute a : sh -c echo 'list jobname=$FQDN' | /usr/sbin/bconsole. Lately I added 2 more nodes to backup in Bareos. I dunno if it's related but it's looks like it was the straw too many. My colleagues had the issue once or twice a few months ago but it resolved itself and no effort have been spent to know what was going on. killall -9 bconsole was the answer. Now, it has been raised to me ;)

- Analysis
If you stop nrpe and killall all remaining check_bacula and associated bconsole. The box will recover after a while. The load will obviously decrease but mainly bconsole will work again otherwise it hangs on the prompt. Indeed it does NOT display the '*' therefore you cannot execute anything. It will work after a while. If you wait a certain amount of time, it becomes quicker and quicker to get the '*' and to work from there. This makes me think it is related to timeout or sockets overuse (bconsole to director socket in TIME_WAIT). But I dunno where is the bottlenecks and what is causing it except the fact we have too many bconsole. BTW, I agree that this monitoring design is shit and I am happy to move to something else. If you have any recommendations feel free to share. But in the mean time, I think there is still a bug to solve just about the crazy memory usage. Back to the issue, at first I thought it could be a postgresql issue because it was still running with the default el6 setup (32MB of shared_mem). I increased the parameters to something bigger (see additional info) and even reindexed all tables. No go. Same problem. A select * from pg_stat_activity did not display any running queries, nothing. So I believe the problem is between bconsole and the director but I cannot confirm.

- Questions

1. Could the slowness be related to a number of concurrent connections ?
2. Could the slowness be related to the postgresql backend ?
3. Why would it use so much memory ?
4. Why does bconsole takes a long time to (progressively) recover once the flood is over ?

- TODO
I am happy to follow any directions to help to debug the issue because I just don't know where to start from.
If useful I can try to use strace -tt next and save the netstat output.


Steps To ReproduceI can reproduce it by starting my nagios client and wait a certain amount of time.

I am not sure how you can reproduce the issue because I dunno where the problem is yet (bareos or postgresql setup?). Maybe you could reproduce it by starting a lot of sh -c echo 'list jobname=$FQDN' | /usr/sbin/bconsole for a few hours on your test box.
Additional Information* check_bacula
http://exchange.nagios.org/directory/Plugins/Backup-and-Recovery/Bacula/check_bacula_lastbackup-2Epl/details

* Version
$ rpm -qa |grep bareos
bareos-common-13.2.1-81.1.el6.x86_64
bareos-filedaemon-13.2.1-81.1.el6.x86_64
bareos-tools-13.2.1-81.1.el6.x86_64
bareos-database-common-13.2.1-81.1.el6.x86_64
bareos-database-tools-13.2.1-81.1.el6.x86_64
bareos-bconsole-13.2.1-81.1.el6.x86_64
bareos-director-13.2.1-81.1.el6.x86_64
bareos-storage-13.2.1-81.1.el6.x86_64
bareos-database-postgresql-13.2.1-81.1.el6.x86_64
bareos-client-13.2.1-81.1.el6.x86_64
$ cat /etc/redhat-release
CentOS release 6.4 (Final)


* Director config
- Password lines have been removed
- Servername has been replaced by bareos.server.fqdn
- Emails have been tweaked

# Bacula Director Master Configuration
# for bareos.server.fqdn

# Define the name of this director so other clients can
# connect to it and work with our system
Director {
  Name = "bareos.server.fqdn:director"
  Query File = "/usr/lib/bareos/scripts/query.sql"
  Working Directory = "/var/lib/bareos"
  PID Directory = "/var/run/bareos"
  Maximum Concurrent Jobs = 20
  Messages = "bareos.server.fqdn:messages:daemon"
}

# This is where the catalog information will be stored (basically
# this should be how to connect to whatever database we're using)
Catalog {
  Name = "bareos.server.fqdn:postgresql"
  dbname = "bareos"; dbdriver = postgresql
      user = bareos; password = "b4cul4p4ssw0rd"
  }

# Configure how the directory will log and/or send messages. This
# should should be for just about everything.
Messages {
  Name = "bareos.server.fqdn:messages:standard"
  Mail Command = "/usr/sbin/bsmtp -h localhost -f devnull@domain.tld -s \"Bacula %t %e (for %c)\" %r"
  Operator Command = "/usr/sbin/bsmtp -h localhost -f devnull@domain.tld -s \"Bacula Intervention Required (for %c)\" %r"
  Mail = devnull@domain.tld = all, !skipped
  Mail On Error = sysadmin@domain.tld = all, !skipped
  Operator = sysadmin@domain.tld = mount
  Console = all, !skipped, !saved
  # WARNING! the following will create a file that you must cycle from
  # time to time as it will grow indefinitely. However, it will
  # also keep all your messages if they scroll off the console.
  Append = "/var/log/bareos/bareos.server.fqdn:director.log" = all, !skipped
  Catalog = all
}

# These are messages directly from the various daemons themselves.
Messages {
  Name = "bareos.server.fqdn:messages:daemon"
  Mail Command = "/usr/sbin/bsmtp -h localhost -f devnull@domain.tld -s \"Bacula Notice (from Director %d)\" %r"
  Mail = sysadmin@domain.tld = all, !skipped
  Console = all, !skipped, !saved
  Append = "/var/log/bareos/bareos.server.fqdn:director.log" = all, !skipped
}


# DEFAULT STORAGE SERVER ------------------------------------------------------
# All the clients will define their own Storage Daemon configuration as they
# will connect to a dedicated File device on that director (to aid Pool & Volume
# management along with concurrent access). This section will define a default
# Storage Daemon to connect to (using the standard FileStorage device) and a
# Pool which will be used with that as well.
Storage {
  Name = "bareos.server.fqdn:storage:default"
  Address = bareos.server.fqdn
  Device = "DefaultFileStorage"
  Media Type = File
  Maximum Concurrent Jobs = 20
}
Storage {
  Name = "bareos.server.fqdn:storage:BackupCatalog"
  Address = bareos.server.fqdn
  Device = "FileStorage:BackupCatalog"
  Media Type = File-BackupCatalog
  Maximum Concurrent Jobs = 3
}

Pool {
  Name = "bareos.server.fqdn:pool:default"
  # All Volumes will have the format standard.date.time to ensure they
  # are kept unique throughout the operation and also aid quick analysis
  # We won't use a counter format for this at the moment.
  Label Format = "${Job}.${Year}${Month:p/2/0/r}${Day:p/2/0/r}.${Hour:p/2/0/r}${Minute:p/2/0/r}"
  Pool Type = Backup
  # Clean up any we don't need, and keep them for a maximum of a month (in
  # theory the same time period for weekly backups from the clients)
  # Note the files for the old volumes will still remain on the disk but will
  # be truncated to a zero size.
  Recycle = No
  Auto Prune = Yes
  Action On Purge = Truncate
  Volume Retention = 1 Year
  # Don't allow re-use of volumes; one volume per job only
  Maximum Volume Jobs = 1
}
Pool {
  Name = "bareos.server.fqdn:pool:default.full"
  # All Volumes will have the format standard.date.time to ensure they
  # are kept unique throughout the operation and also aid quick analysis
  # We won't use a counter format for this at the moment.
  Label Format = "${Job}.full.${Year}${Month:p/2/0/r}${Day:p/2/0/r}.${Hour:p/2/0/r}${Minute:p/2/0/r}"
  Pool Type = Backup
  # Clean up any we don't need, and keep them for a maximum of a year
  # Note the files for the old volumes will still remain on the disk but will
  # be truncated to a zero size.
  Recycle = No
  Auto Prune = Yes
  Action On Purge = Truncate
  Volume Retention = 1 Year
  # Don't allow re-use of volumes; one volume per job only
  Maximum Volume Jobs = 1
}

Pool {
  Name = "bareos.server.fqdn:pool:default.differential"
  # All Volumes will have the format standard.date.time to ensure they
  # are kept unique throughout the operation and also aid quick analysis
  # We won't use a counter format for this at the moment.
  Label Format = "${Job}.diff.${Year}${Month:p/2/0/r}${Day:p/2/0/r}.${Hour:p/2/0/r}${Minute:p/2/0/r}"
  Pool Type = Backup
  # Clean up any we don't need, and keep them for a maximum of fourty days
  # Note the files for the old volumes will still remain on the disk but will
  # be truncated to a zero size.
  Recycle = No
  Auto Prune = Yes
  Action On Purge = Truncate
  Volume Retention = 3 Months
  # Don't allow re-use of volumes; one volume per job only
  Maximum Volume Jobs = 1
}

Pool {
  Name = "bareos.server.fqdn:pool:default.incremental"
  # All Volumes will have the format standard.date.time to ensure they
  # are kept unique throughout the operation and also aid quick analysis
  # We won't use a counter format for this at the moment.
  Label Format = "${Job}.incr.${Year}${Month:p/2/0/r}${Day:p/2/0/r}.${Hour:p/2/0/r}${Minute:p/2/0/r}"
  Pool Type = Backup
  # Clean up any we don't need, and keep them for a maximum of 40 days
  # Note the files for the old volumes will still remain on the disk but will
  # be truncated to a zero size.
  Recycle = No
  Auto Prune = Yes
  Action On Purge = Truncate
  Volume Retention = 40 Days
  # Don't allow re-use of volumes; one volume per job only
  Maximum Volume Jobs = 1
}

Pool {
  Name = "bareos.server.fqdn:pool:catalog"
  # All Volumes will have the format director.catalog.date.time to ensure they
  # are kept unique throughout the operation and also aid quick analysis
  # Label Format = "${Job}.bareos.server.fqdn.${CounterLdr-bacula1Catalog+:p/3/0/r}"
  Label Format = "${Job}.bareos.server.fqdn.${Year}${Month:p/2/0/r}${Day:p/2/0/r}.${Hour:p/2/0/r}${Minute:p/2/0/r}"
  Pool Type = Backup
  # Clean up any we don't need, and keep them for a maximum of a month (in
  # theory the same time period for weekly backups from the clients)
  Recycle = No
  Auto Prune = Yes
  Action On Purge = Truncate
  # We have no limit on the number of volumes, but we will simply set that
  # we should keep at least one weeks worth of backups of the database
  Volume Retention = 1 Week
  # Don't allow re-use of volumes; one volume per job only
  Maximum Volume Jobs = 1
}

# Create a Counter which will be used to label the catalog volumes on the system
Counter {
  Name = "CounterLdr-bacula1Catalog"
  Minimum = 1
  Catalog = "bareos.server.fqdn:postgresql"
}

# FILE SETS -------------------------------------------------------------------
# Define the standard set of locations which which will be backed up (along
# what within those should not be). In general, we have two types:
#
# Basic:noHome This doesn't back up the /home directory as its mounted
# from an NFS director on the network (this is the default).
# Basic:withHome This one does for servers where we don't mount NFS on it.

FileSet {
  Name = "Basic:noHome"
  Include {
    Options {
      Signature = SHA1
      Compression = lz4hc
      Shadowing = localwarn
    }

    # Don't worry about most of the director as Puppet manages the
    # configuration. Ensure that per-machine state files or settings
    # are backed up, along with stuff from /var or /srv which should be
    # most service-related files
    File = /boot
    File = /etc
    File = /usr/local
    File = /var
    File = /opt
    File = /srv
    # /home will not be backed up on any normal director as it's managed from
    # a central file-server for most servers.
  }

  Exclude {
    # Ignore stuff that can be ignored
    File = /var/cache
    File = /var/tmp
    # The state of the packages installed, or their files, etc.
    # can be ignored as we use puppet to rebuild much of the server
    File = /var/lib/apt
    File = /var/lib/dpkg
    File = /var/lib/puppet
    File = /var/lib/yum
    # Ignore database stuff; this will need to be handled
    # using some sort of a dump script
    File = /var/lib/mysql
    File = /var/lib/postgresql
    File = /var/lib/ldap
    # Bacula's state files are no use to us on restore
    File = /var/lib/bareos
  }
}

FileSet {
  Name = "Basic:withHome"
  Include {
    Options {
      Signature = SHA1
      Compression = lz4hc
      Shadowing = localwarn
    }

    File = /boot
    File = /etc
    File = /usr/local
    File = /var
    File = /opt
    File = /srv
    # This set does include /home
    File = /home
  }

  Exclude {
    File = /var/cache
    File = /var/tmp
    File = /var/lib/apt
    File = /var/lib/dpkg
    File = /var/lib/puppet
    File = /var/lib/mysql
    File = /var/lib/postgresql
    File = /var/lib/ldap
    File = /var/lib/bareos
    File = /var/lib/yum
  }
}

FileSet {
  Name = "LinuxAll"
  Include {
    Options {
      Signature = SHA1
      Compression = lz4hc
      Shadowing = localwarn
      One FS = No # change into other filessytems
      FS Type = ext2 # filesystems of given types will be backed up
      FS Type = ext3 # others will be ignored
      FS Type = ext4
      FS Type = xfs
      FS Type = reiserfs
      FS Type = jfs
      FS Type = btrfs
      FS Type = vzfs
    }
    File = /
  }
  Exclude {
    File = /proc
    File = /tmp
    File = /.journal
    File = /.fsck
    File = /sys
    File = /dev
    File = /var/cache
    File = /var/cfengine/config
    File = /var/tmp
    File = /var/lib/apt
    File = /var/lib/dpkg
    File = /var/lib/puppet
    File = /var/lib/mysql
    # Nagios check results
    File = /var/nagios/spool/checkresults
    # postgresql data directories
    File = /var/lib/postgresql
    File = /var/lib/pgsql
    File = /var/lib/ldap
    # Backup Servers
    File = /var/lib/bareos
    File = /var/lib/bacula
    # Yum temp files
    File = /var/lib/yum
    # Cobbler Servers repo mirror
    File = /var/www/cobbler/repo_mirror
    # Virtuozzo/openvz conainters
    File = /vz
    # Devsunserver localhomes
    File = /localhome
    # Spacewalk repositories
    File = /var/satellite
    # Dbbackup local dump dir
    File = /opt/dbbackup
    # Special filesystem to ignore
    File = /var/lib/nfs/rpc_pipefs
  }

}

# This set is specifically for Bacula to allow it to backup its own internal
# cataloge as part of the normal process.
FileSet {
  Name = "Catalog"
  Include {
    Options {
      Signature = SHA1
      Compression = lz4hc
    }
    File = "/var/lib/bareos/bareos.sql"
  }
}


# SCHEDULE --------------------------------------------------------------------
# Define when jobs should be run, and what Levels of backups they will be when
# they are run.

# These two are the default backup schedule; don't change them
Schedule {
  Name = "WeeklyCycle"
  Run = Level=Full First Sun at 23:05
  Run = Level=Differential Second-Fifth Sun at 23:05
  Run = Level=Incremental Mon-Sat at 23:05
}

Schedule {
  Name = "WeeklyCycleAfterBackup"
  Run = Level=Full Mon-Sun at 05:10
}

# These cycles are set up so that we can spread out the full backups of our
# servers across the week. Some at the weekend, some mid-week.
Schedule {
  Name = "Weekly:onFriday"
  Run = Level=Full First Fri at 22:00
  Run = Level=Differential Second-Fifth Fri at 22:00
  Run = Level=Incremental Sat-Thu at 22:00
}

Schedule {
  Name = "Weekly:onSaturday"
  # Because this is a weekend job, we'll start the full runs earlier
  Run = Level=Full First Sat at 22:00
  Run = Level=Differential Second-Fifth Sat at 22:00
  Run = Level=Incremental Sun-Fri at 22:00
}

Schedule {
  Name = "Weekly:onSunday"
  # Because this is a weekend job, we'll start the full runs earlier
  Run = Level=Full First Sun at 22:00
  Run = Level=Differential Second-Fifth Sun at 22:00
  Run = Level=Incremental Mon-Sat at 22:00
}

Schedule {
  Name = "Weekly:onMonday"
  Run = Level=Full First Mon at 22:00
  Run = Level=Differential Second-Fifth Mon at 22:00
  Run = Level=Incremental Tue-Sun at 22:00
}

Schedule {
  Name = "Weekly:onTuesday"
  Run = Level=Full First Tue at 22:00
  Run = Level=Differential Second-Fifth Tue at 22:00
  Run = Level=Incremental Wed-Mon at 22:00
}

Schedule {
  Name = "Weekly:onWednesday"
  Run = Level=Full First Wed at 22:00
  Run = Level=Differential Second-Fifth Wed at 22:00
  Run = Level=Incremental Thu-Tue at 22:00
}

Schedule {
  Name = "Weekly:onThursday"
  Run = Level=Full First Thu at 22:00
  Run = Level=Differential Second-Fifth Thu at 22:00
  Run = Level=Incremental Fri-Wed at 22:00
}

Schedule {
  Name = "Hourly"
  Run = Level=Incremental hourly at 0:30
}

# JOB DEFINITIONS -------------------------------------------------------------
# Create the types of jobs we need to run.

# Backup the catalog database (after the nightly save)
Job {
  Name = "BackupCatalog"
  Type = Backup
  Client = bareos.server.fqdn
  FileSet="Catalog"
  Schedule = "WeeklyCycleAfterBackup"
  Storage = "bareos.server.fqdn:storage:BackupCatalog"
  Messages = "bareos.server.fqdn:messages:standard"
  Pool = "bareos.server.fqdn:pool:catalog"
  # This creates an ASCII copy of the catalog
  RunBeforeJob = "/usr/lib/bareos/scripts//make_catalog_backup.pl bareos.server.fqdn:postgresql"
  # This deletes the copy of the catalog
  RunAfterJob = "/usr/lib/bareos/scripts//delete_catalog_backup"
  Write Bootstrap = "/mnt/bareos/bootstraps/BackupCatalog.bsr"
  # Run after main backup
  Priority = 50
# This doesn't seem to be working correctly removing it.
# RunScript {
# RunsWhen=After
# RunsOnClient=No
# Console = "purge volume action=all allpools storage=File"
# }
}

# Create a standard profile for all normal servers
JobDefs {
  Name = "Basic:noHome:onMonday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:noHome"
  Schedule = "Weekly:onMonday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:withHome:onMonday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:withHome"
  Schedule = "Weekly:onMonday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:noHome:onTuesday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:noHome"
  Schedule = "Weekly:onTuesday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:withHome:onTuesday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:withHome"
  Schedule = "Weekly:onTuesday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:noHome:onWednesday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:noHome"
  Schedule = "Weekly:onWednesday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:withHome:onWednesday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:withHome"
  Schedule = "Weekly:onWednesday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:noHome:onThursday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:noHome"
  Schedule = "Weekly:onThursday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:withHome:onThursday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:withHome"
  Schedule = "Weekly:onThursday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:noHome:onFriday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:noHome"
  Schedule = "Weekly:onFriday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:withHome:onFriday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:withHome"
  Schedule = "Weekly:onFriday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:noHome:onSaturday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:noHome"
  Schedule = "Weekly:onSaturday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:withHome:onSaturday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:withHome"
  Schedule = "Weekly:onSaturday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:noHome:onSunday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:noHome"
  Schedule = "Weekly:onSunday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}
JobDefs {
  Name = "Basic:withHome:onSunday"
  Type = Backup
  Level = Incremental
  FileSet = "Basic:withHome"
  Schedule = "Weekly:onSunday"
  Messages = "bareos.server.fqdn:messages:standard"
  # Set the job to work as standard with the default Pool & Storage
  # (this will be overridden by the Job configuration for each Client)
  Storage = "bareos.server.fqdn:storage:default"
  Pool = "bareos.server.fqdn:pool:default"
  Write Bootstrap = "/mnt/bareos/bootstraps/%c.bsr"
  Priority = 15
  # Define how long any of these jobs are allowed to run for before we should
  # kill them. Note that this is the run time (how long the actual backup is
  # running for after starting, and not a maximum time after it was scheduled)
  Full Max Run Time = 36 Hours
  Differential Max Run Time = 6 Hours
  Incremental Max Run Time = 6 Hours
}

# Finally, bring in all the additional pieces of configuration from the
# different servers for which this Director was configured to manage
@|"sh -c 'for f in /etc/bareos/bareos-dir.d/*.conf ; do echo @${f} ; done'"

* Logs/Dumps

top - 13:36:09 up 48 days, 5:30, 3 users, load average: 75.72, 55.57, 41.56
Tasks: 994 total, 1 running, 992 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.3%us, 0.2%sy, 0.0%ni, 99.1%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16315864k total, 16191248k used, 124616k free, 2524k buffers
Swap: 4194296k total, 4165692k used, 28604k free, 118164k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19298 root 20 0 26568 2028 984 R 0.7 0.0 0:00.05 top
  107 root 20 0 0 0 0 S 0.2 0.0 2:23.14 kblockd/1
  179 root 20 0 0 0 0 D 0.2 0.0 16:49.87 kswapd1
  234 root 39 19 0 0 0 S 0.2 0.0 70:41.91 kipmi0
 6549 root 20 0 3222m 1.6g 212 D 0.2 10.5 0:13.18 bconsole <<< Look at RES
 6638 root 20 0 3226m 1.6g 128 D 0.2 10.2 0:12.90 bconsole <<<
15873 root 20 0 2529m 568m 128 D 0.2 3.6 0:02.99 bconsole <<<
15879 root 20 0 2529m 579m 128 D 0.2 3.6 0:03.06 bconsole ..
15967 root 20 0 2529m 560m 128 D 0.2 3.5 0:02.88 bconsole .
15981 root 20 0 2529m 558m 128 D 0.2 3.5 0:02.69 bconsole
15990 root 20 0 2529m 570m 128 D 0.2 3.6 0:02.79 bconsole
16050 root 20 0 2529m 572m 128 D 0.2 3.6 0:02.73 bconsole
19374 root 20 0 151m 2204 1716 D 0.2 0.0 0:00.01 sudo
    1 root 20 0 21304 228 88 S 0.0 0.0 2:50.25 init
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
[..]


Process 26880 attached - interrupt to quit
read(3, ^C <unfinished ...>
Process 26880 detached
[root@ldr-bacula1.begen.iseek.com.au:~]# strace -ff -p 26880
Process 26880 attached with 3 threads - interrupt to quit
[pid 26882] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 26880] read(3, <unfinished ...>
[pid 26881] restart_syscall(<... resuming interrupted call ...>^C <unfinished ...>
Process 26880 detached
Process 26881 detached
Process 26882 detached
[root@ldr-bacula1.begen.iseek.com.au:~]# strace -ff -p 26880
Process 26880 attached with 3 threads - interrupt to quit
[pid 26882] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 26881] restart_syscall(<... resuming interrupted call ...> <unfinished ...> <<<<<<<<<<<<<<<<< UNTIL here it was slow, futex ? timeout ??
[pid 26880] read(3,
 <unfinished ...>
[pid 26881] <... restart_syscall resumed> ) = 0
[pid 26881] nanosleep({30, 0}, <unfinished ...>
[pid 26882] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 26882] futex(0x7f5746f56d80, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 26882] futex(0x7f5746f56dc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 7, {1385943234, 724898000}, ffffffff <unfinished ...>
[pid 26881] <... nanosleep resumed> NULL) = 0
[pid 26881] nanosleep({30, 0}, NULL) = 0
[pid 26881] nanosleep({30, 0}, <unfinished ...>
[pid 26882] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 26882] futex(0x7f5746f56d80, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 26882] futex(0x7f5746f56dc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 9, {1385943294, 725191000}, ffffffff <unfinished ...>
[pid 26880] <... read resumed> "\0\0\0S", 4) = 4
[pid 26880] read(3, "auth cram-md5 <551356239.1385943"..., 83) = 83
[pid 26880] write(3, "\0\0\0\0279+/kl9/uOk/0Mx/AO0REED\0", 27) = 27
[pid 26880] poll([{fd=3, events=POLLIN}], 1, 180000) = 1 ([{fd=3, revents=POLLIN}])
[pid 26880] read(3, "\0\0\0\r", 4) = 4
[pid 26880] read(3, "1000 OK auth\n", 13) = 13
[pid 26880] uname({sys="Linux", node="ldr-bacula1.begen.iseek.com.au", ...}) = 0
[pid 26880] write(3, "\0\0\0004auth cram-md5 <327659900.138"..., 56) = 56
[pid 26880] poll([{fd=3, events=POLLIN}], 1, 180000) = 1 ([{fd=3, revents=POLLIN}])
[pid 26880] read(3, "\0\0\0\27", 4) = 4
[pid 26880] read(3, "F749NyhRlQ/sFi+2eA+tlC\0", 23) = 23
[pid 26880] write(3, "\0\0\0\r1000 OK auth\n", 17) = 17
[pid 26880] read(3, "\0\0\0T", 4) = 4
[pid 26880] read(3, "1000 OK: ldr-bacula1.begen.iseek"..., 84) = 84
[pid 26880] futex(0x7f5746f56dc4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5746f56dc0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 26880] nanosleep({0, 100000}, <unfinished ...>
[pid 26882] <... futex resumed> ) = 0
[pid 26882] futex(0x7f5746f56d80, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 26880] <... nanosleep resumed> NULL) = 0
[pid 26882] futex(0x7f5746f56dc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 11, {1385943298, 390011000}, ffffffff <unfinished ...>
[pid 26880] write(1, "Enter a period to cancel a comma"..., 36) = 36
TagsNo tags attached.

Activities

TomWork

TomWork

2013-12-05 02:20

reporter   ~0000745

I forgot to say that when the system is stable it takes no time to display a job list. Memory footprint is probably very low as well.

real 0m0.062s
user 0m0.010s
sys 0m0.010s
mvwieringen

mvwieringen

2013-12-05 13:35

developer   ~0000746

You have the following in your director config:

Maximum Concurrent Jobs = 20

That means you cannot run more the 20 jobs concurrently keep in mind that
every bconsole is a new Job too. There is a work-queue in the director that
is sized using the above setting and it means that after it has reached
20 jobs it will only accept any new connection after one other has finished.

First of all I think its kind of useless to query your system like this
and especially by using so many bconsole sessions. Keep in mind that bconsole
is a hollow program it does nothing more the display the user agent that the
director runs. So you are probably just blowing up the director and as that
stop accepting connections it seems to trigger some stuff in the bconsole that
starts leaking memory in a frantic way.

I think it makes much more sense to use NSCA probes in Nagios for these
kind of monitoring which turn this whole monitoring into something passive
and when the NSCA probe doesn't get the clearance the backup ran it will
trigger an error in your monitoring.

As to why bconsole blows up no idea but I think the code is not very
robust when it comes to handling this resource starvation errors. You
could do a kill -SEGV of one of the processes so it will invoke the traceback
handler but for that to have any info you need to install the debug rpms and
I don't know how that exactly works on centos.
TomWork

TomWork

2013-12-06 09:03

reporter   ~0000747

1. I didn't know bconsole was a Job.

2. Yes, I agree that the current monitoring system is brain dead. It has to be redone. However because you seem to know what kind of passive check would you run ?

3. For the kill -SEGV are you talking about debug symbols for bareos ? I can install : bareos-debuginfo.x86_64. Just to be sure I assume I will have to kill -SEGV `pidof bconsole` and not the director right.
mvwieringen

mvwieringen

2013-12-06 10:02

developer   ~0000748

Yes you need to install the debug symbols otherwise there is not much to
debug with the created core. It should also create a proper core file
so we can do some postmortem debugging. And yes it has to be the bconsole
as you showed that to use the large amount of memory. I think the dir is
fine as it won't run more then 20 jobs concurrently anyway.

As to the passive check how about create one in Nagios and then in the
post backup script post via NSCA to Nagios that the Job succeeded ?
When you put a freshness on such a passive check it will automatically
trigger if your backups don't work for a longer time. e.g. if you expect
a backup every day a freshness of 25 hours or so should work. Its been
quite some time I used Nagios myself but ages ago that was a nice and
elegant solution.
mvwieringen

mvwieringen

2013-12-06 10:05

developer   ~0000749

If you can capture a core-file we can see how we can analyze that to get an
idea what is eating all the memory.
TomWork

TomWork

2013-12-09 03:20

reporter   ~0000751

Hi,

I can be wrong but I believe bconsole is trying to execute something it cannot find because everything is deployed in a /usr/lib/debug prefix. Is it something you wanted to do ? I am wondering because I believe you wanted to deploy it from / hence the .debug suffix to all binaries. If I have more time I will try to see what is it trying to start. Obviously for the moment I cannot get a coredump :)

# rpm -q bareos-debuginfo
bareos-debuginfo-13.2.1-81.1.el6.x86_64

# rpm -ql bareos-debuginfo |grep bconsole
/usr/lib/debug/usr/sbin/bconsole.debug

# ls -l /usr/lib/debug/usr/sbin/bconsole.debug
-r-xr-xr-x 1 root root 86072 Sep 9 22:56 /usr/lib/debug/usr/sbin/bconsole.debug

[/usr/lib/debug/usr/sbin]# /usr/lib/debug/usr/sbin/bconsole.debug
-bash: /usr/lib/debug/usr/sbin/bconsole.debug: bad ELF interpreter: No such file or directory

# strace /usr/lib/debug/usr/sbin/bconsole.debug
execve("/usr/lib/debug/usr/sbin/bconsole.debug", ["/usr/lib/debug/usr/sbin/bconsole"...], [/* 21 vars */]) = -1 ENOENT (No such file or directory)
dup(2) = 3
fcntl(3, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
fstat(3, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0408a01000
lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
write(3, "strace: exec: No such file or di"..., 40strace: exec: No such file or directory
) = 40
close(3) = 0
munmap(0x7f0408a01000, 4096) = 0
exit_group(1) = ?

# ls -l /usr/lib/debug/usr/sbin/bconsole.debug
-r-xr-xr-x 1 root root 86072 Sep 9 22:56 bconsole.debug

# ls -l /usr/sbin/bconsole
-rwxr-xr-x 1 root root 37216 Sep 9 22:56 /usr/sbin/bconsole

# uname -r
2.6.32-358.23.2.el6.x86_64
TomWork

TomWork

2013-12-11 08:05

reporter   ~0000758

Re,

I don't know why bconsole.debug is not working :/

# ls -l /usr/sbin/bconsole.debug
-r-xr-xr-x 1 root root 86072 Dec 11 16:57 /usr/sbin/bconsole.debug

# file /usr/sbin/bconsole*
/usr/sbin/bconsole: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped
/usr/sbin/bconsole.debug: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, not stripped

# bconsole.debug
-bash: /usr/sbin/bconsole.debug: bad ELF interpreter: No such file or directory

# strace -s 1024 bconsole.debug
execve("/usr/sbin/bconsole.debug", ["bconsole.debug"], [/* 21 vars */]) = -1 ENOENT (No such file or directory) <<<<<<<<<<<<<<<<< WHY ??
dup(2) = 3
fcntl(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE)
fstat(3, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f95647bb000
lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
write(3, "strace: exec: No such file or directory\n", 40strace: exec: No such file or directory
) = 40
close(3) = 0
munmap(0x7f95647bb000, 4096) = 0
exit_group(1) = ?

# strace -s 1024 bconsole 2>&1|head
execve("/usr/sbin/bconsole", ["bconsole"], [/* 21 vars */]) = 0 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< no problem here.
brk(0) = 0x1c89000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa934130000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/usr/lib64/tls/x86_64/libreadline.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/tls/x86_64", 0x7ffff42565c0) = -1 ENOENT (No such file or directory)
open("/usr/lib64/tls/libreadline.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/tls", {st_mode=S_IFDIR|0555, st_size=4096, ...}) = 0
open("/usr/lib64/x86_64/libreadline.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/x86_64", 0x7ffff42565c0) = -1 ENOENT (No such file or directory)
mvwieringen

mvwieringen

2013-12-12 12:16

developer   ~0000762

Maybe you can look at this:

http://le-huy.blogspot.de/2011/01/using-debuginfo-packages-in-redhat.html

but it seems the debug file is nothing more then a so called symbol table.
e.g. normally the binaries are stripped so when you then start gdb it cannot
really show any symbols in your binary only some decoded assembly.

So you should just start the normal binary and get it to crash when its so
big. Then it should create a corefile and put gdb on that showing the
stacktrace using the symbol table in the .debug file.
TomWork

TomWork

2013-12-24 03:01

reporter   ~0000769

Hi,

I finally have the time to look into it again. I have restarted nrpe so a lot of bconsole should be launched again. However it has not crashed or increased the load yet. That could be related to the fact I deleted a lot of volumes in the Catalog and on disk. I have 9228 media left and I removed 6659 media today. We will see how it goes but could this be related to a certain amount of rows in the media table maybe ? Does it sound possible ?

I will keep you posted if I can reproduce the issue again.
TomWork

TomWork

2013-12-26 13:08

reporter   ~0000770

Last edited: 2013-12-26 13:09

It finally happened again. I have a core file for :
 
 5557 root 20 0 1716m 1.1g 140 D 0.3 7.1 0:08.91 bconsole

You can download a 1.5GB core file via http://bogan.dmw.fr/~tom/bconsole.core.5557

Also note that my bareos-dir is currently using 4GB and I cannot connect with bconsole but it could be because it's running our backup jobs at the moment.

14765 bareos 20 0 4368m 8468 776 S 0.0 0.1 13:20.46 bareos-dir

mvwieringen

mvwieringen

2013-12-26 17:13

developer   ~0000771

Some first impressions:

- compress the core file next time with bzip2 ==> 1.5 Gb => 83 Kb.
- Its seems from the above there is really a lot of data that compresses
  like hell. Doing a fast strings -a on the core already reveals one
  possible problem ==> /root/.bconsole_history e.g. the readline history
  file is going out of control that might account for quite some of the
  1.5 Gb of ram. Just remove it as I guess its not usable anyway.
- You seem to use the world famous one volume per job which is known
  to scale poorly with both Bacula and Bareos.
- The last thing you say is that your dir does not respond that is
  more or less the same problem as why all the NRPE spawned bconsoles
  hang and that is that the director doesn't allow enough concurrent jobs
  (and as every console is a job in the director).

So first analyze drop the history file and start a complete redesign of your
backup environment this is never going to work, you are just way outside
the design envelop of both Bacula and Bareos.

Maybe there is also a bug somewhere but with these vast amount of data its
unlikely we are ever going to find most of it. I will see if I can get any
info from the core but I would be interested if you remove the history file
if a single bconsole is already using a somewhat reasonable amount of memory,
TomWork

TomWork

2013-12-27 03:07

reporter   ~0000772

Hi Marco,

My apologies for the big corefile, it didn't cross my mind that this thing would compress that much or compress at all. Sorry.

About the .bconsole_history I emptied it. I will check if the problem is as worse. Thank you for looking into the corefile. I quickly tried to sort | uniq my bconsole.history and indeed there is a lot of crap in there probably due to check_bacula but a lot didn't make much sense. Anyway now it's empty I will keep an eye on it.

About the redesign that's rather unfortunate. Could you explain why this is not going to work ?

We do not use tapes at all. I find the idea of one volume per job very simple to understand and very simple to manage. I can understand Bareos/bacula is not designed for that but I didn't know that maybe you should state that somewhere. It is even explained in the Bacula documentation how to do it IIRC. How could we guess as a user that Bacula/Bareos is not designed for that ? I am sure a lot of people are using a Volume per Job because it's simple and because now it's published everywhere on the Internet (Blog, articles, howtos, etc). I think if you believe this is a wrong design you should write about it - if time allows - to explain why and what are the other solutions. Otherwise it will be worse and worse. Tapes are not dead but a lot of people backup D2D(2D) therefore the concept of volume recycling or fixed sizes and etc are maybe obsolete in that context, IMO.

At the moment, I am disappointed that Bareos won't answer my requirements. We moved from our old Bacula to Bareos with this design in mind and now we have migrated more 200 servers to it. I am not going to roll back now but I will plan another migration based on your advises or with another software (I do NOT mean to offend here). At the moment it works OK if I remove our silly monitoring which I agree is badly design. RunJobs + NCSA are perfect for that, thanks for the hint.

Please keep me posted about the design, what's wrong and how you would do it. Thank you.
mvwieringen

mvwieringen

2013-12-27 10:13

developer   ~0000773

The console history had a million entries about deleting volumes
which triggered me to think you were using one volume per job.
Could it be that there is a permission problem on the bconsole history ?
Normally the history should have a max of 200 entries and its truncated
by the code but that doesn't seem to work.

The problem with that is that is doesn't scale. The queries for volumes
are sized on a tape based system e.g. that means a limited number of volumes.

With the one volume per job the number of volumes explode (it works nicely
for small setups but I wonder how well it will work larger setups.) Also you
create new volumes all the time and that means the mediaid will increase all
the time, that will end eventually (maybe not soon but that depends a bit
on how many volumes you create a day.)

About the documentation, that is work in progress as you know we got that
by forking bacula and it has quite some work to be done there.

Also bacula and bareos both have the same problem with a large number of
volumes so in that sense bacula would run into the same problems eventually.

My prefered setup is using fixed size virtual tapes and have them in a pool
with automatic recycling and then use copy jobs to tape. We also changed the
default config to use some saner defaults. Problem with fixed sized volumes
is however the chance of things becoming corrupt and that is what I have seen
a serious problem on Linux. I have been spoiled in that sense on Solaris with
ZFS where that never happens anymore. Maybe in the future when BTRFS stabilizes
or ZFS gets a real option that is also solved.

My first change would be to use one volume per day, that limits the risks
of corruption to one day. And it makes you not create a million volumes a day.

About the real problem in the core, I guess it just crashes in the memory
allocator as it ran out of memory. The smartalloc stuff from bacula will
do a null pointer dereference as way to force a core dump (bit like an
assert).

Other then that if the design you now have makes you happy you may want
to keep it but I think it will break eventually.
TomWork

TomWork

2013-12-30 07:10

reporter   ~0000775

What's the biggest number for a media/volume in the code ?
With a PostgreSQL, I can see it's an integer so 2147483647 [1].

Ideally if we can, it would be good to change it to bigserial or a bigint. And in the code as an unsigned long ;) However at the moment I am doing a bit less than 300 volumes a day. If the max is only pgsql int type aka 2147483647 and not in bareos code then I still have 19611 years in front of me - at a rate of 300 volume a day minus the mediaID I already used.

About the doco, I fully understand. I was not whinging about it.

About limitations. I know that Bareos limitations come from Bacula. My point about changing software would be for something else that would allow me to have one volume per job. What everybody wants is backup concurrency (many SD otherwise you cannot write to the same volume IIRC) and the code design not tied to the tape design. I also understand that there is a lot work if you guys would like to move towards that direction. IMO, keeping a such design based on tape would be a mistake. Tapes will still exist but are less important these days. Backups have to be fast, reliable, and should autoclean on disk once the retention make them obsolete to free space for the new backups coming.

I will think about your 1 volume per day. I am not sure it will suit our needs but it's a compromise.

About the issue, I dunno why .bconsole_history was clobbered with a lot of deletes. What I can tell you is that the .bconsole_history file was probably longer than 200 lines. I stopped my sort | uniq -c after 5min. I believe sorting 200 lines should have taken less than 1 sec on my server. However, interestingly enough, at the moment, .bconsole_history is exactly 200 long.

# wc -l .bconsole_history
200 .bconsole_history

Weird...

Anyway, if you don't hear from me in 3 weeks you can resolve this ticket. Either the problem stopped or I finally moved to nagios passive checks. I will do my best to ask you to resolve the ticket or to give new info about the issue.

Thanks for your support.

[1] : http://www.postgresql.org/docs/9.2/static/datatype-numeric.html

Issue History

Date Modified Username Field Change
2013-12-05 02:15 TomWork New Issue
2013-12-05 02:20 TomWork Note Added: 0000745
2013-12-05 13:35 mvwieringen Note Added: 0000746
2013-12-06 09:03 TomWork Note Added: 0000747
2013-12-06 10:02 mvwieringen Note Added: 0000748
2013-12-06 10:05 mvwieringen Note Added: 0000749
2013-12-06 10:05 mvwieringen Assigned To => mvwieringen
2013-12-06 10:05 mvwieringen Status new => feedback
2013-12-09 03:20 TomWork Note Added: 0000751
2013-12-09 03:20 TomWork Status feedback => assigned
2013-12-11 08:05 TomWork Note Added: 0000758
2013-12-12 12:16 mvwieringen Note Added: 0000762
2013-12-24 03:01 TomWork Note Added: 0000769
2013-12-26 13:08 TomWork Note Added: 0000770
2013-12-26 13:09 TomWork Note Edited: 0000770
2013-12-26 17:13 mvwieringen Note Added: 0000771
2013-12-26 17:14 mvwieringen Status assigned => feedback
2013-12-27 03:07 TomWork Note Added: 0000772
2013-12-27 03:07 TomWork Status feedback => assigned
2013-12-27 10:13 mvwieringen Note Added: 0000773
2013-12-27 10:15 mvwieringen Status assigned => feedback
2013-12-30 07:10 TomWork Note Added: 0000775
2013-12-30 07:10 TomWork Status feedback => assigned
2014-01-23 09:20 mvwieringen Assigned To mvwieringen =>
2014-01-23 09:20 mvwieringen Status assigned => closed
2014-01-23 09:20 mvwieringen Resolution open => no change required