0000600: Webui unresponsive, indefinitely, and Postgresql at 100 % CPU for 20x minutes on Restore page with 589K+ files

ID	Project	Category	View Status	Date Submitted	Last Update

0000600	bareos-core	webui	public	2016-01-17 05:01	2016-04-25 17:15

Reporter	xyros	Assigned To
Priority	normal	Severity	major	Reproducibility	always
Status	closed	Resolution	no change required
Platform	Linux	OS	Ubuntu	OS Version	14.04
Product Version	15.2.2

Summary	0000600: Webui unresponsive, indefinitely, and Postgresql at 100 % CPU for 20x minutes on Restore page with 589K+ files
Description	When I select the Restore page, the first client loaded (based on alphabetic order) contains an Ubuntu user home directory backup with the following characteristics: 59 GB backup (20 GB LZ4 compressed) 65846 folders 589227 files Bareos webui immediately becomes unresponsive (indefinitely) and Postgresql consumes 100 % CPU for about 20 minutes. By the time Postgresql has settled down again, the webui has timed out and therefore a new login is required. If one allows the process to complete, the Restore page is subsequently usable and the client's files are shown -- including after a system reboot. Performing another full backup on the same client causes the time Postgresql remains consuming 100 % CPU to increase to about 40 minutes. Another full backup with the same client increases this to about about 60 minutes. Note: purging that client's files (using bconsole), prior to the next full backup, makes not difference to the time it takes for the processing to complete and the high CPU load to die down. However, if the purge is performed before the Restore page triggered parsing, the page loads immediately and the client's folder structure is browseable (actual file content obviously missing). This high CPU issue happens both with Postresql using default settings in postgresql.conf and with Postgresql tuned using pgtune. I then performed a final test by rerunning a full backup for a client that has a very small number of files. Restore page responsiveness was immediate. Therefore, this issue appears to happen only for backups that have a large number of files; and the effect is multiplied by the number of full backups that exist for that client. I tried looking at the /usr/share/bareos-webui/public/js/jquery.dataTables.js file to see if there was something obvious I could change, but I don't understand the code. It does, however, seem (if I'm reading the code right) as though server-side processing is not being used. The DataTables documentation, at the following URL, recommends enabling server-side processing mode for processing large data sets: https://datatables.net/reference/option/serverSide Given how unresponsive the browser becomes, I'm assuming client-side processing is what's currently implemented. If you need additional information, please let me know. I can't provide access to the actual systems; however, if you have difficulty reproducing, I may be able to write a bash script to analyze the directory and file structure and create a script you can run to build an approximate representation of the directory tree -- with random data and filename. Thanks
Steps To Reproduce	1. Perform a full backup of a Linux directory with the following approximate characteristics: 59 GB backup (20 GB LZ4 compressed) 65846 folders 589227 files 2. Once the backup completes, login to the webui, select the Restore page/tab and select the client you backed up in Step 1.
Tags	No tags attached.

reddrop 2016-01-18 02:09 reporter ~0002099	I can confirm I am also seeing the same issue with MariaDB 5.5 and the following bareos versions on Centos 7 bareos.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2 bareos-bat.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2 bareos-bconsole.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2 bareos-client.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2 bareos-common.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2 bareos-database-common.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2 bareos-database-mysql.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2 bareos-database-tools.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2 bareos-director.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2 bareos-filedaemon.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2 bareos-storage.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2 bareos-storage-tape.x86_64 15.2.2-37.1.el7 @bareos_bareos-15.2 bareos-webui.noarch 15.2.2-41.1.el7 @bareos_bareos-15.2

hege 2016-01-18 11:45 reporter ~0002100 Last edited: 2016-01-18 11:45	Crontab this every morning or script it after backups etc echo '.bvfs_update' \| bconsole If you never manually run update on path hierarchy cache, it will take forever when you first run a large restore job, since it runs that exact same command first. Google more for bvfs_update..

reddrop 2016-01-19 00:48 reporter ~0002102	Hi I ran, echo '.bvfs_update' \| bconsole IT made no difference to the time out in the restore pain of the webui. Additionally this is the error message i get emailed, once the restore does time out. 19-Jan 10:43 sydplbak001-dir: ERROR in output_formatter.c:633 Failed to send result as json. Maybe result message to long? Message length = 323

xyros 2016-01-19 02:22 reporter ~0002103	@reddrop: hege is not suggesting that it would be faster. Rather, he/she is suggesting automating running that command after backups are completed so that it does the time consuming work before you need to attempt a restore. It is an interesting workaround but does not resolve the actual issues.

hege 2016-01-19 08:12 reporter ~0002104 Last edited: 2016-01-19 08:15	You can easily verify it's doing bvfs_update from the database, if you look at active sql queries. It's doing inserts into PathVisibility table. I don't know why this small speedup patch is not yet applied from master to 15.2 branch, it will help with bvfs_update speed. I've had it patched into my 15.2 for long time: https://github.com/bareos/bareos/commit/df52b6735cd39acf01b13a22f52e0287a76bb926 My suggestion resolves the problems you have if PathVisibility is the culprit, which most certainly is the case. I don't consider it as a "workaround", it does exactly the same things as webui. Updates the cache beforehand, so restores are possible, but running it beforehand you don't need to wait long for first time directory listing. Bareos should make this better documented and update the cache even automatically or optimize the whole thing another way.

reddrop 2016-01-19 09:03 reporter ~0002105	I don't think i was clear enough in my previous post: I ran echo '.bvfs_update' \| bconsole several times and my webui is still timing out when trying list the files for selection. The query being run while the webui spins forever and says loading is: SELECT 'D', tmp.PathId, 0, tmp.Path, JobId, LStat, FileId FROM (SELECT PPathId AS PathId, '..' AS Pa...

frank 2016-01-19 14:58 developer ~0002106 Last edited: 2016-01-19 15:03	So, let's break it down step by step to track down the culprit. What the webui basically does first when calling the restore module is following: 1. Switch into api 2 mode (json) .api 2 2. Get all needed job ids for a normal complete restore. With the "all" option, all defined Filesets will be used. .bvfs_get_jobids jobid=10 all 3. Compute the directory cache for the jobs you got from previous command .bvfs_update jobid=1,2,3,10 4. Get the root or windows drives for representing our first level of the tree. .bvfs_lsdirs jobid=1,2,3,10 path= Every subtree/node underneath toplevel is lazy loaded by the use of .bvfs_lsdirs and .bvfs_lsfiles. Please excute those commands above in bconsole and verify if you encounter any problems during the initial construction of the tree in your environment and let us know. Also you can test those commands in api 0 to look for any differences. In addition you can try to list directories and files with lots of content in api mode 2 to look for any anomalies. By this we can better differentiate where to look further, director/db or webui. More details of the bvfs api are documented here: http://doc.bareos.org/master/html/bareos-developer-guide.html#sec:bvfs

xyros 2016-01-19 17:08 reporter ~0002107	@frank: The patch hege linked to would fix the primary cause of this issue. The person who wrote that patch says it took his from 17-18 hours down to only 17 minutes, for a dataset of greater than 1 million files. For convenience, the relate post is here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=737821 The remaining issue would then be related to the webui hanging during processing. As for your instructions above, I'll give it a try, if I can figure out how to use the api 2 mode you mentioned. Thanks

mvwieringen 2016-01-19 17:24 developer ~0002108	The linked patch was for MySQL so I'm not so sure that such drastic performance improvements are seen with PostgreSQL as the original initiator of the bugs says he is using. The reason its only in master is because we haven't had a chance to fully test what side effects its has on other databases and workloads. Most people also need to tune their database both PostgreSQL and any MySQL derivate as some problems just occur due to the query analyzer taking the non-optimal query plan.

xyros 2016-01-19 18:29 reporter ~0002109	OK. I've run the commands frank indicated; however, I have no idea what I'm looking at and how to tell if the output is correct or not. Do you want me to post the output of a run against the linux client with a large number of files (assuming it will contain no sensitive information)?

mvwieringen 2016-01-19 18:38 developer ~0002110	We are not so interested in the actual output more in response to certain commands and how fast they respond. Because I get the feeling its more a tuning problem then anything else. So its probably more a stopwatch test to see how long the different steps take and see where the bottleneck is if there is one. If we don't find any we can always zoom in on particular data but for a first test that is not strictly needed. It would be interesting however if some error is posted to have that data. b.t.w. for eveything only attach data to the bug so we can also remove it after processing etc.

xyros 2016-01-19 19:45 reporter ~0002111	Here are my results (based on the steps frank posted): Step 1: immediate Step 2: approx 1/4 seconds Step 3: 29:29 minutes Step 4: immediate Note: My postgresql install is already tuned, as I noted in the root post.

mvwieringen 2016-01-19 20:08 developer ~0002112	Ok so it spends its time in the update_path_hierarchy_cache() function so this has little to do with the webui. You could try the experimental master director with the changed query or run the current director with a debug level 500 which prints the the queries it fires against the database so we can exactly see what query it is. Other option would be to look in postgresql and see what query its executing.

xyros 2016-01-19 20:10 reporter ~0002113	Minor typo. Step 3 should have read: 23:29 minutes

xyros 2016-01-19 20:25 reporter ~0002114	Is there a way to delete that path hierarchy cache so I perform the additional tests you've requested, without having to perform a new full backup and compound the processing time? Other questions: Where do I get the experimental master director? I installed Bareos from the Ubuntu PPA distro.

mvwieringen 2016-01-19 20:37 developer ~0002115	The cache is a cache so you don't have to do a new backup. It only builds a cache to speed up things which in your situation only seem to slow down the process. .bvfs_clear_cache clears the cache e.g. truncates the meta table. regarding experimental packages we publish those on http://download.bareos.org/bareos/experimental/nightly/ there are for most if not all Linux distributions packages there. Its build from the current master branch and after a internal function test published during the night around 04:30 CET. As the database schema is still the same it should just work. However if you have storage and filed on the same machine they also need to be upgraded as the shared libs are numbered by version.

xyros 2016-01-19 21:24 reporter ~0002116	Thanks. I'll test and post my results sometime between now and tomorrow morning.

xyros 2016-01-20 18:29 reporter ~0002130	Some good news: It was my fault. I ran pgtune and edited my postgresql.conf but forgot the "MB" after "wal_buffers = 16" and also forget to apply "effective_cache_size" setting. I discovered my error when building a VM to run the experimental build and conduct the tests. With those fixed, the .bvfs_update step completes in 30 seconds on both the production server and the VM I had setup for testing. Clearing the cache and choosing the client from the Restore page of the webui, it completes in 35 seconds. So, only marginally slower from the webui. The net result is that on a properly tuned Postresql performance is reasonable, and there is no perceptible performance difference between the current release and experimental builds. A RECOMMENDATION: With all that said, on restart, bareos-dir outputs a warning if the "max_connections" setting for Postgresql is below the "MaxConcurrentJobs" for bareos-dir. It would be great, given how critical the tuning it to bareos-dir performance, if bareos-dir would also warn if the database has not been tuned.

mvwieringen 2016-01-20 18:51 developer ~0002131	Honestly I do think that is something for some external tool. There are quite some tuning tools for different databases which take into account the filling of the different tables etc. That is not something for a backup tool to try doing better. b.t.w. if you set it to 16 do you then get 16 x 4 Kb or 8 Kb ? As that would mean 64 or 128 Kb max. The 30 vs 35 seconds is not to bad as it also has to go through php into the javascript applet and then being rendered by your browser.

xyros 2016-01-20 19:56 reporter ~0002132	"b.t.w. if you set it to 16 do you then get 16 x 4 Kb or 8 Kb ? As that would mean 64 or 128 Kb max." I'm not sure what you're referring to with that statement. Are you referring to the wal_buffers? If so, I have that set at 16MB.

xyros 2016-01-20 20:01 reporter ~0002133	If my post wasn't supposed to changed the bug status, that was not intentional, as all I did was refresh the page and post my response.

mvwieringen 2016-01-21 16:14 developer ~0002140	I was more interested in what tiny size you end up with when setting it to 16 without any multiplier like Mb/Gb etc. But I think we can close this bug as it was obviously mostly a tuning problem on the backend.

xyros 2016-01-21 20:38 reporter ~0002142	According to the Postgresql documentation, "any positive value less than 32kB will be treated as 32kB." http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html Yes the bug can be closed. I have not other issues related to this. Any other issues seem particular to MySQL queries, as noted earlier, and are likely best dealt with via a different bug report. Thank you for all your help.

frank 2016-01-22 11:55 developer ~0002146	Closed as stated above.

Date Modified	Username	Field	Change
2016-01-17 05:01	xyros	New Issue
2016-01-18 02:09	reddrop	Note Added: 0002099
2016-01-18 11:45	hege	Note Added: 0002100
2016-01-18 11:45	hege	Note Edited: 0002100
2016-01-18 17:46	frank	Assigned To	=> frank
2016-01-18 17:46	frank	Status	new => assigned
2016-01-19 00:48	reddrop	Note Added: 0002102
2016-01-19 02:22	xyros	Note Added: 0002103
2016-01-19 08:12	hege	Note Added: 0002104
2016-01-19 08:15	hege	Note Edited: 0002104
2016-01-19 08:15	hege	Note Edited: 0002104
2016-01-19 09:03	reddrop	Note Added: 0002105
2016-01-19 14:58	frank	Note Added: 0002106
2016-01-19 15:03	frank	Note Edited: 0002106
2016-01-19 15:04	frank	Status	assigned => acknowledged
2016-01-19 17:08	xyros	Note Added: 0002107
2016-01-19 17:24	mvwieringen	Note Added: 0002108
2016-01-19 18:29	xyros	Note Added: 0002109
2016-01-19 18:38	mvwieringen	Note Added: 0002110
2016-01-19 19:45	xyros	Note Added: 0002111
2016-01-19 20:08	mvwieringen	Note Added: 0002112
2016-01-19 20:10	xyros	Note Added: 0002113
2016-01-19 20:25	xyros	Note Added: 0002114
2016-01-19 20:37	mvwieringen	Note Added: 0002115
2016-01-19 21:24	xyros	Note Added: 0002116
2016-01-20 18:29	xyros	Note Added: 0002130
2016-01-20 18:51	mvwieringen	Note Added: 0002131
2016-01-20 18:52	mvwieringen	Status	acknowledged => feedback
2016-01-20 19:56	xyros	Note Added: 0002132
2016-01-20 19:56	xyros	Status	feedback => assigned
2016-01-20 20:01	xyros	Note Added: 0002133
2016-01-21 12:04	frank	Status	assigned => feedback
2016-01-21 16:14	mvwieringen	Note Added: 0002140
2016-01-21 20:38	xyros	Note Added: 0002142
2016-01-21 20:38	xyros	Status	feedback => assigned
2016-01-22 11:55	frank	Note Added: 0002146
2016-01-22 11:55	frank	Status	assigned => closed
2016-01-22 11:55	frank	Assigned To	frank =>
2016-01-22 11:55	frank	Resolution	open => no change required

Reporting new Issues is disabled, please Report new Issues at https://github.com/bareos/bareos/issues

View Issue Details

Activities

Issue History