View Issue Details

IDProjectCategoryView StatusLast Update
0000600bareos-corewebuipublic2016-04-25 17:15
Reporterxyros Assigned To 
PrioritynormalSeveritymajorReproducibilityalways
Status closedResolutionno change required 
PlatformLinuxOSUbuntuOS Version14.04
Product Version15.2.2 
Summary0000600: Webui unresponsive, indefinitely, and Postgresql at 100 % CPU for 20x minutes on Restore page with 589K+ files
DescriptionWhen I select the Restore page, the first client loaded (based on alphabetic order) contains an Ubuntu user home directory backup with the following characteristics:

59 GB backup (20 GB LZ4 compressed)
65846 folders
589227 files

Bareos webui immediately becomes unresponsive (indefinitely) and Postgresql consumes 100 % CPU for about 20 minutes. By the time Postgresql has settled down again, the webui has timed out and therefore a new login is required. If one allows the process to complete, the Restore page is subsequently usable and the client's files are shown -- including after a system reboot.

Performing another full backup on the same client causes the time Postgresql remains consuming 100 % CPU to increase to about 40 minutes. Another full backup with the same client increases this to about about 60 minutes. Note: purging that client's files (using bconsole), prior to the next full backup, makes not difference to the time it takes for the processing to complete and the high CPU load to die down. However, if the purge is performed before the Restore page triggered parsing, the page loads immediately and the client's folder structure is browseable (actual file content obviously missing). This high CPU issue happens both with Postresql using default settings in postgresql.conf and with Postgresql tuned using pgtune.

I then performed a final test by rerunning a full backup for a client that has a very small number of files. Restore page responsiveness was immediate. Therefore, this issue appears to happen only for backups that have a large number of files; and the effect is multiplied by the number of full backups that exist for that client.

I tried looking at the /usr/share/bareos-webui/public/js/jquery.dataTables.js file to see if there was something obvious I could change, but I don't understand the code. It does, however, seem (if I'm reading the code right) as though server-side processing is not being used. The DataTables documentation, at the following URL, recommends enabling server-side processing mode for processing large data sets: https://datatables.net/reference/option/serverSide Given how unresponsive the browser becomes, I'm assuming client-side processing is what's currently implemented.

If you need additional information, please let me know. I can't provide access to the actual systems; however, if you have difficulty reproducing, I may be able to write a bash script to analyze the directory and file structure and create a script you can run to build an approximate representation of the directory tree -- with random data and filename.

Thanks
Steps To Reproduce1. Perform a full backup of a Linux directory with the following approximate characteristics:

59 GB backup (20 GB LZ4 compressed)
65846 folders
589227 files

2. Once the backup completes, login to the webui, select the Restore page/tab and select the client you backed up in Step 1.
TagsNo tags attached.

Activities

reddrop

reddrop

2016-01-18 02:09

reporter   ~0002099

I can confirm I am also seeing the same issue with MariaDB 5.5 and the following
bareos versions on Centos 7

bareos.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2
bareos-bat.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2
bareos-bconsole.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2
bareos-client.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2
bareos-common.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2
bareos-database-common.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2
bareos-database-mysql.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2
bareos-database-tools.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2
bareos-director.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2
bareos-filedaemon.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2
bareos-storage.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2
bareos-storage-tape.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2
bareos-webui.noarch 15.2.2-41.1.el7 @bareos_bareos-15.2
hege

hege

2016-01-18 11:45

reporter   ~0002100

Last edited: 2016-01-18 11:45

Crontab this every morning or script it after backups etc

echo '.bvfs_update' | bconsole

If you never manually run update on path hierarchy cache, it will take forever when you first run a large restore job, since it runs that exact same command first.

Google more for bvfs_update..

reddrop

reddrop

2016-01-19 00:48

reporter   ~0002102

Hi I ran,

echo '.bvfs_update' | bconsole

IT made no difference to the time out in the restore pain of the webui.

Additionally this is the error message i get emailed, once the restore does time out.

19-Jan 10:43 sydplbak001-dir: ERROR in output_formatter.c:633 Failed to send result as json. Maybe result message to long?
Message length = 323
xyros

xyros

2016-01-19 02:22

reporter   ~0002103

@reddrop: hege is not suggesting that it would be faster. Rather, he/she is suggesting automating running that command after backups are completed so that it does the time consuming work before you need to attempt a restore.

It is an interesting workaround but does not resolve the actual issues.
hege

hege

2016-01-19 08:12

reporter   ~0002104

Last edited: 2016-01-19 08:15

You can easily verify it's doing bvfs_update from the database, if you look at active sql queries. It's doing inserts into PathVisibility table.

I don't know why this small speedup patch is not yet applied from master to 15.2 branch, it will help with bvfs_update speed. I've had it patched into my 15.2 for long time:
https://github.com/bareos/bareos/commit/df52b6735cd39acf01b13a22f52e0287a76bb926

My suggestion resolves the problems you have if PathVisibility is the culprit, which most certainly is the case. I don't consider it as a "workaround", it does exactly the same things as webui. Updates the cache beforehand, so restores are possible, but running it beforehand you don't need to wait long for first time directory listing. Bareos should make this better documented and update the cache even automatically or optimize the whole thing another way.

reddrop

reddrop

2016-01-19 09:03

reporter   ~0002105

I don't think i was clear enough in my previous post:

I ran echo '.bvfs_update' | bconsole several times and my webui is still timing out when trying list the files for selection.

The query being run while the webui spins forever and says loading is:
SELECT 'D', tmp.PathId, 0, tmp.Path, JobId, LStat, FileId FROM (SELECT PPathId AS PathId, '..' AS Pa...
frank

frank

2016-01-19 14:58

developer   ~0002106

Last edited: 2016-01-19 15:03

So, let's break it down step by step to track down the culprit.


What the webui basically does first when calling the restore module is following:


1. Switch into api 2 mode (json)

.api 2

2. Get all needed job ids for a normal complete restore. With the "all" option,
all defined Filesets will be used.

.bvfs_get_jobids jobid=10 all

3. Compute the directory cache for the jobs you got from previous command

.bvfs_update jobid=1,2,3,10

4. Get the root or windows drives for representing our first level of the tree.

.bvfs_lsdirs jobid=1,2,3,10 path=


Every subtree/node underneath toplevel is lazy loaded by the use of .bvfs_lsdirs and .bvfs_lsfiles.

Please excute those commands above in bconsole and verify if you encounter any problems during the initial
construction of the tree in your environment and let us know. Also you can test those commands in api 0 to
look for any differences.

In addition you can try to list directories and files with lots of content in api mode 2 to look for any anomalies.

By this we can better differentiate where to look further, director/db or webui.


More details of the bvfs api are documented here:

http://doc.bareos.org/master/html/bareos-developer-guide.html#sec:bvfs

xyros

xyros

2016-01-19 17:08

reporter   ~0002107

@frank: The patch hege linked to would fix the primary cause of this issue. The person who wrote that patch says it took his from 17-18 hours down to only 17 minutes, for a dataset of greater than 1 million files. For convenience, the relate post is here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=737821

The remaining issue would then be related to the webui hanging during processing.

As for your instructions above, I'll give it a try, if I can figure out how to use the api 2 mode you mentioned.

Thanks
mvwieringen

mvwieringen

2016-01-19 17:24

developer   ~0002108

The linked patch was for MySQL so I'm not so sure that such drastic performance
improvements are seen with PostgreSQL as the original initiator of the bugs
says he is using. The reason its only in master is because we haven't had a
chance to fully test what side effects its has on other databases and workloads.
Most people also need to tune their database both PostgreSQL and any MySQL
derivate as some problems just occur due to the query analyzer taking the
non-optimal query plan.
xyros

xyros

2016-01-19 18:29

reporter   ~0002109

OK. I've run the commands frank indicated; however, I have no idea what I'm looking at and how to tell if the output is correct or not. Do you want me to post the output of a run against the linux client with a large number of files (assuming it will contain no sensitive information)?
mvwieringen

mvwieringen

2016-01-19 18:38

developer   ~0002110

We are not so interested in the actual output more in response to certain
commands and how fast they respond. Because I get the feeling its more
a tuning problem then anything else. So its probably more a stopwatch test
to see how long the different steps take and see where the bottleneck is if
there is one. If we don't find any we can always zoom in on particular data
but for a first test that is not strictly needed. It would be interesting
however if some error is posted to have that data. b.t.w. for eveything
only attach data to the bug so we can also remove it after processing etc.
xyros

xyros

2016-01-19 19:45

reporter   ~0002111

Here are my results (based on the steps frank posted):

Step 1: immediate
Step 2: approx 1/4 seconds
Step 3: 29:29 minutes
Step 4: immediate

Note: My postgresql install is already tuned, as I noted in the root post.
mvwieringen

mvwieringen

2016-01-19 20:08

developer   ~0002112

Ok so it spends its time in the update_path_hierarchy_cache() function
so this has little to do with the webui. You could try the experimental
master director with the changed query or run the current director with
a debug level 500 which prints the the queries it fires against the
database so we can exactly see what query it is. Other option would be to
look in postgresql and see what query its executing.
xyros

xyros

2016-01-19 20:10

reporter   ~0002113

Minor typo. Step 3 should have read: 23:29 minutes
xyros

xyros

2016-01-19 20:25

reporter   ~0002114

Is there a way to delete that path hierarchy cache so I perform the additional tests you've requested, without having to perform a new full backup and compound the processing time?

Other questions: Where do I get the experimental master director? I installed Bareos from the Ubuntu PPA distro.
mvwieringen

mvwieringen

2016-01-19 20:37

developer   ~0002115

The cache is a cache so you don't have to do a new backup. It only builds
a cache to speed up things which in your situation only seem to slow down
the process.

.bvfs_clear_cache clears the cache e.g. truncates the meta table.

regarding experimental packages we publish those on

http://download.bareos.org/bareos/experimental/nightly/

there are for most if not all Linux distributions packages there.
Its build from the current master branch and after a internal function
test published during the night around 04:30 CET.

As the database schema is still the same it should just work. However if you
have storage and filed on the same machine they also need to be upgraded as
the shared libs are numbered by version.
xyros

xyros

2016-01-19 21:24

reporter   ~0002116

Thanks. I'll test and post my results sometime between now and tomorrow morning.
xyros

xyros

2016-01-20 18:29

reporter   ~0002130

Some good news:

It was my fault. I ran pgtune and edited my postgresql.conf but forgot the "MB" after "wal_buffers = 16" and also forget to apply "effective_cache_size" setting. I discovered my error when building a VM to run the experimental build and conduct the tests.

With those fixed, the .bvfs_update step completes in 30 seconds on both the production server and the VM I had setup for testing. Clearing the cache and choosing the client from the Restore page of the webui, it completes in 35 seconds. So, only marginally slower from the webui.

The net result is that on a properly tuned Postresql performance is reasonable, and there is no perceptible performance difference between the current release and experimental builds.

A RECOMMENDATION:
With all that said, on restart, bareos-dir outputs a warning if the "max_connections" setting for Postgresql is below the "MaxConcurrentJobs" for bareos-dir.

It would be great, given how critical the tuning it to bareos-dir performance, if bareos-dir would also warn if the database has not been tuned.
mvwieringen

mvwieringen

2016-01-20 18:51

developer   ~0002131

Honestly I do think that is something for some external tool. There are quite
some tuning tools for different databases which take into account the filling
of the different tables etc. That is not something for a backup tool to try
doing better. b.t.w. if you set it to 16 do you then get 16 x 4 Kb or 8 Kb ?
As that would mean 64 or 128 Kb max.

The 30 vs 35 seconds is not to bad as it also has to go through php into
the javascript applet and then being rendered by your browser.
xyros

xyros

2016-01-20 19:56

reporter   ~0002132

"b.t.w. if you set it to 16 do you then get 16 x 4 Kb or 8 Kb ?
As that would mean 64 or 128 Kb max."

I'm not sure what you're referring to with that statement. Are you referring to the wal_buffers? If so, I have that set at 16MB.
xyros

xyros

2016-01-20 20:01

reporter   ~0002133

If my post wasn't supposed to changed the bug status, that was not intentional, as all I did was refresh the page and post my response.
mvwieringen

mvwieringen

2016-01-21 16:14

developer   ~0002140

I was more interested in what tiny size you end up with when setting it to
16 without any multiplier like Mb/Gb etc. But I think we can close this bug
as it was obviously mostly a tuning problem on the backend.
xyros

xyros

2016-01-21 20:38

reporter   ~0002142

According to the Postgresql documentation, "any positive value less than 32kB will be treated as 32kB." http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html

Yes the bug can be closed. I have not other issues related to this. Any other issues seem particular to MySQL queries, as noted earlier, and are likely best dealt with via a different bug report.

Thank you for all your help.
frank

frank

2016-01-22 11:55

developer   ~0002146

Closed as stated above.

Issue History

Date Modified Username Field Change
2016-01-17 05:01 xyros New Issue
2016-01-18 02:09 reddrop Note Added: 0002099
2016-01-18 11:45 hege Note Added: 0002100
2016-01-18 11:45 hege Note Edited: 0002100
2016-01-18 17:46 frank Assigned To => frank
2016-01-18 17:46 frank Status new => assigned
2016-01-19 00:48 reddrop Note Added: 0002102
2016-01-19 02:22 xyros Note Added: 0002103
2016-01-19 08:12 hege Note Added: 0002104
2016-01-19 08:15 hege Note Edited: 0002104
2016-01-19 08:15 hege Note Edited: 0002104
2016-01-19 09:03 reddrop Note Added: 0002105
2016-01-19 14:58 frank Note Added: 0002106
2016-01-19 15:03 frank Note Edited: 0002106
2016-01-19 15:04 frank Status assigned => acknowledged
2016-01-19 17:08 xyros Note Added: 0002107
2016-01-19 17:24 mvwieringen Note Added: 0002108
2016-01-19 18:29 xyros Note Added: 0002109
2016-01-19 18:38 mvwieringen Note Added: 0002110
2016-01-19 19:45 xyros Note Added: 0002111
2016-01-19 20:08 mvwieringen Note Added: 0002112
2016-01-19 20:10 xyros Note Added: 0002113
2016-01-19 20:25 xyros Note Added: 0002114
2016-01-19 20:37 mvwieringen Note Added: 0002115
2016-01-19 21:24 xyros Note Added: 0002116
2016-01-20 18:29 xyros Note Added: 0002130
2016-01-20 18:51 mvwieringen Note Added: 0002131
2016-01-20 18:52 mvwieringen Status acknowledged => feedback
2016-01-20 19:56 xyros Note Added: 0002132
2016-01-20 19:56 xyros Status feedback => assigned
2016-01-20 20:01 xyros Note Added: 0002133
2016-01-21 12:04 frank Status assigned => feedback
2016-01-21 16:14 mvwieringen Note Added: 0002140
2016-01-21 20:38 xyros Note Added: 0002142
2016-01-21 20:38 xyros Status feedback => assigned
2016-01-22 11:55 frank Note Added: 0002146
2016-01-22 11:55 frank Status assigned => closed
2016-01-22 11:55 frank Assigned To frank =>
2016-01-22 11:55 frank Resolution open => no change required