0000189: verify linux magazin article about unused indexes - Bareos Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0000189	bareos-core	director	public	2013-06-13 08:23	2015-10-22 13:27

Reporter	pstorz	Assigned To	stephand
Priority	normal	Severity	minor	Reproducibility	have not tried
Status	closed	Resolution	no change required
Platform	Linux	OS	any	OS Version	3
Product Version	12.4.4

Summary	0000189: verify linux magazin article about unused indexes
Description	linux mag article about postgres optimize says bacula has unneeded idxs on file table
Tags	No tags attached.

stephand 2013-07-05 14:35 developer ~0000497	The mentioned article from the german Linux-Magazine 07/13 regarding PostgreSQL Performance tuning is this one: http://www.linux-magazin.de/Ausgaben/2013/07/PostgreSQL-Tuning It is not available for free online. On page 25 it shows the output of SELECT relname,indexrelname,idx_scan,idx_tup_read,idx_tup_fetch FROM pg_stat_user_indexes; In the example in the article the indices file_jpfid_idx and file_pathid_idx have never been used. On page 26 the article explains that unused indeces are a pure wast of resources. In bareos the index file_pathid_idx is not enabled by default. /usr/lib/bareos/scripts/make_postgresql_tables has the following comments: -- Possibly add one or more of the following indexes -- if your Verifies are too slow, but they can slow down -- backups. -- -- CREATE INDEX file_pathid_idx ON file(pathid); -- CREATE INDEX file_filenameid_idx ON file(filenameid); So the article is right, if verify jobs are not beeing used, the file_pathid_idx index may be useless. Regarding file_jpfid_idx, which is defined as CREATE INDEX file_jpfid_idx ON File (JobId, PathId, FilenameId); This index will be used as soon as using restore with menu point 7 or 8: 7: Enter a list of files to restore 8: Enter a list of files to restore before a specified time In these modes, absolute paths must be entered or can be read from a filelist file by entering </path/to/restore_filelist which is the same like using restore client=bareost01-fd file=</var/tmp/restore.filelist The index matters because then no directory tree is built, but one SELECT statement is executed for each file entered or listed. The backup impact of this index is low. Compared the duration of repeated full backups of 1.000.000 Files that have already been backed up before, so that inserts go mainly to the File table. Results ith 4 backup runs each without file_jpfid_idx: Avg. Elapsed time: 5 mins 12 secs with file_jpfid_idx: Avg. Elapsed time: 5 mins 21 secs Other things the article does not mention: - The planner does sequential scans on small tables because it is faster, nevertheless, when tables grow, the index will be used. - does not mention EXPLAIN ANALYZE, may be due to limitation of word count. Although this topic is worth a complete article, also it shows if and which indices are used in a given query - that resetting the statistics counters is possible with SELECT pg_stat_reset();

stephand 2013-07-05 18:55 developer ~0000509	Done further testing, it turned out that the type of restore mentioned above is dramatically faster, when replacing the index file_jpfid_idx on File(Jobid, Pathid, Filenameid) by an index only on Pathid and Filenameid like this: DROP INDEX file_jpfid_idx; CREATE INDEX file_pfid_idx ON File(Pathid, Filenameid);

AndiH 2013-08-08 12:08 reporter ~0000567	The problem is related to the fact that a multi column index is only useful if the left-most column is actually needed for the query (http://www.postgresql.org/docs/9.2/interactive/indexes-multicolumn.html). As at least recent PostgreSQL is able to use more than one index per column and if a single column index on JobId is already there, further index should not have JobId as left-most IMHO. So the soulution could be to only use single column index or at least a left-most column which is not already covered by another index, no? Not sure about MySQL though but there are rumors that it does not support multiple index per query... Taken this all maybe single column index on all three is the way to go, but the overhead on inserts have to be taken into account. BTW, that's our numbers without verify jobs: Scans Tuple-Read Tuple-Fetched Size file_jobid_idx 19103 2133101746 142814099 1955 MB file_jpfid_idx 9798 5721 2996 2769 MB Database size is ~20G and file table contain around 75 mio rows.

stephand 2013-08-08 16:17 developer ~0000568	Thanks for your interesting comment. So you example also shows that the file_jpfid_idx index was used. I'v done my testing using PostgreSQL 8.4 on CentOS 6. The description of multicolumn indexes is same for 8.4, see http://www.postgresql.org/docs/8.4/interactive/indexes-multicolumn.html Using a filelist based restore like this echo 'restore client=bareost01-fd before="2013-06-11 16:30:00" file=</var/tmp/restore.filelist yes' \| bconsole results in queries like the following for each file in the list: EXPLAIN ANALYZE SELECT Job.JobId,File.FileIndex FROM Job,File,Path,Filename,Client WHERE Job.JobId=File.JobId AND Job.StartTime<='2013-06-11 16:30:00' AND Path.Path='/data/test03/4-KrZryr6RzfXrl7C/4-afiFEmWFiFCGjXDQioJKEOuXjwZZc2/40-USyTUGqUuswuk06zw3Frcf5yQMkRP2N/' AND Filename.Name='YIbQbHayxOZ9TRd5d16' AND Client.Name='bareost01-fd' AND Job.ClientId=Client.ClientId AND Path.PathId=File.PathId AND Filename.FilenameId=File.FilenameId AND JobStatus IN ('T','W') AND Type='B' ORDER BY Job.StartTime DESC LIMIT 1; In my tests it turned out this type of query is much faster when only using Pathid, Filenameid for the index. I wasn't aware how much the column order in multicolumn indexes matters. As far as I understand the explanation, when using the default index on the 3 columns file_jpfid_idx ON File (JobId, PathId, FilenameId) that could explain why it is slower. Although there is a constraint on JobId in the query (Job.JobId=File.JobId), that would mean that the index portion to be scanned is much bigger than when using the index on only PathId, FilenameId. So if the column order in the index works like this, then it should be even faster when defining the index with the Filenameid first like this: CREATE INDEX file_fpid_idx ON File(Filenameid, Pathid); I also tested with separate indexes on Pathid and Filenameid, but it turned out to be a bit slower than using the Multicolumn index on Pathid, Filenameid I did not look into MySQL yet, but it seems that it uses "index merge" to combine separate indexes. Anyway, referring to http://www.mysqlperformanceblog.com/2008/08/22/multiple-column-index-vs-multiple-indexes/ it looks like multicolumn indexes are faster if properly set up. Referring to http://www.postgresql.org/docs/8.4/static/indexes-bitmap-scans.html PostgreSQL combines indexes in this way: "To combine multiple indexes, the system scans each needed index and prepares a bitmap in memory giving the locations of table rows that are reported as matching that index's conditions. The bitmaps are then ANDed and ORed together as needed by the query. " This would explain a bit of overhead compared to a multicolumn index. While it looks like multiple single column indexes can be used in a more flexible way by the query optimizer, I would opt for properly designed multicolumn indexes. As we are only talking about integer columns here, I assume that the insert performance impact is not very big in that case.

Date Modified	Username	Field	Change
2013-06-13 08:23	pstorz	New Issue
2013-06-21 16:47	joergs	Assigned To	=> stephand
2013-06-21 16:47	joergs	Status	new => assigned
2013-07-05 14:35	stephand	Note Added: 0000497
2013-07-05 18:55	stephand	Note Added: 0000509
2013-08-08 12:08	AndiH	Note Added: 0000567
2013-08-08 16:17	stephand	Note Added: 0000568
2015-10-22 13:27	stephand	Status	assigned => closed
2015-10-22 13:27	stephand	Resolution	open => no change required