View Issue Details

IDProjectCategoryView StatusLast Update
0000189bareos-coredirectorpublic2015-10-22 13:27
Reporterpstorz Assigned Tostephand  
PrioritynormalSeverityminorReproducibilityhave not tried
Status closedResolutionno change required 
PlatformLinuxOSanyOS Version3
Product Version12.4.4 
Summary0000189: verify linux magazin article about unused indexes
Descriptionlinux mag article about postgres optimize says bacula has unneeded idxs on file table
TagsNo tags attached.

Activities

stephand

stephand

2013-07-05 14:35

developer   ~0000497

The mentioned article from the german Linux-Magazine 07/13 regarding PostgreSQL Performance tuning is this one:
http://www.linux-magazin.de/Ausgaben/2013/07/PostgreSQL-Tuning
It is not available for free online.

On page 25 it shows the output of

SELECT relname,indexrelname,idx_scan,idx_tup_read,idx_tup_fetch FROM pg_stat_user_indexes;

In the example in the article the indices file_jpfid_idx and file_pathid_idx have never been used. On page 26 the article explains that unused indeces are a pure wast of resources.

In bareos the index file_pathid_idx is not enabled by default.
/usr/lib/bareos/scripts/make_postgresql_tables has the following comments:

-- Possibly add one or more of the following indexes
-- if your Verifies are too slow, but they can slow down
-- backups.
--
-- CREATE INDEX file_pathid_idx ON file(pathid);
-- CREATE INDEX file_filenameid_idx ON file(filenameid);

So the article is right, if verify jobs are not beeing used, the file_pathid_idx index may be useless.

Regarding file_jpfid_idx, which is defined as
CREATE INDEX file_jpfid_idx ON File (JobId, PathId, FilenameId);
This index will be used as soon as using restore with menu point 7 or 8:
     7: Enter a list of files to restore
     8: Enter a list of files to restore before a specified time

In these modes, absolute paths must be entered or can be read from a filelist file
by entering </path/to/restore_filelist
which is the same like using
restore client=bareost01-fd file=</var/tmp/restore.filelist
The index matters because then no directory tree is built, but one
SELECT statement is executed for each file entered or listed.

The backup impact of this index is low. Compared the duration of repeated
full backups of 1.000.000 Files that have already been backed up before, so that
inserts go mainly to the File table. Results ith 4 backup runs each
  without file_jpfid_idx: Avg. Elapsed time: 5 mins 12 secs
  with file_jpfid_idx: Avg. Elapsed time: 5 mins 21 secs

Other things the article does not mention:
- The planner does sequential scans on small tables because it is faster,
  nevertheless, when tables grow, the index will be used.
- does not mention EXPLAIN ANALYZE, may be due to limitation of word count.
  Although this topic is worth a complete article, also it shows if and which
  indices are used in a given query
- that resetting the statistics counters is possible with
  SELECT pg_stat_reset();
stephand

stephand

2013-07-05 18:55

developer   ~0000509

Done further testing, it turned out that the type of restore mentioned above is dramatically faster, when replacing the index file_jpfid_idx on File(Jobid, Pathid, Filenameid) by an index only on Pathid and Filenameid like this:

DROP INDEX file_jpfid_idx;
CREATE INDEX file_pfid_idx ON File(Pathid, Filenameid);
AndiH

AndiH

2013-08-08 12:08

reporter   ~0000567

The problem is related to the fact that a multi column index is only useful if the left-most column is actually needed for the query (http://www.postgresql.org/docs/9.2/interactive/indexes-multicolumn.html). As at least recent PostgreSQL is able to use more than one index per column and if a single column index on JobId is already there, further index should not have JobId as left-most IMHO. So the soulution could be to only use single column index or at least a left-most column which is not already covered by another index, no?
Not sure about MySQL though but there are rumors that it does not support multiple index per query...
Taken this all maybe single column index on all three is the way to go, but the overhead on inserts have to be taken into account.

BTW, that's our numbers without verify jobs:
                Scans Tuple-Read Tuple-Fetched Size
file_jobid_idx 19103 2133101746 142814099 1955 MB
file_jpfid_idx 9798 5721 2996 2769 MB

Database size is ~20G and file table contain around 75 mio rows.
stephand

stephand

2013-08-08 16:17

developer   ~0000568

Thanks for your interesting comment. So you example also shows that the file_jpfid_idx index was used.

I'v done my testing using PostgreSQL 8.4 on CentOS 6. The description of multicolumn indexes is same for 8.4, see
http://www.postgresql.org/docs/8.4/interactive/indexes-multicolumn.html

Using a filelist based restore like this

echo 'restore client=bareost01-fd before="2013-06-11 16:30:00" file=</var/tmp/restore.filelist yes' | bconsole

results in queries like the following for each file in the list:

EXPLAIN ANALYZE SELECT Job.JobId,File.FileIndex FROM Job,File,Path,Filename,Client WHERE Job.JobId=File.JobId AND Job.StartTime<='2013-06-11 16:30:00' AND Path.Path='/data/test03/4-KrZryr6RzfXrl7C/4-afiFEmWFiFCGjXDQioJKEOuXjwZZc2/40-USyTUGqUuswuk06zw3Frcf5yQMkRP2N/' AND Filename.Name='YIbQbHayxOZ9TRd5d16' AND Client.Name='bareost01-fd' AND Job.ClientId=Client.ClientId AND Path.PathId=File.PathId AND Filename.FilenameId=File.FilenameId AND JobStatus IN ('T','W') AND Type='B' ORDER BY Job.StartTime DESC LIMIT 1;

In my tests it turned out this type of query is much faster when only using Pathid, Filenameid for the index.

I wasn't aware how much the column order in multicolumn indexes matters. As far as I understand the explanation, when using the default index on the 3 columns file_jpfid_idx ON File (JobId, PathId, FilenameId) that could explain why it is slower. Although there is a constraint on JobId in the query (Job.JobId=File.JobId), that would mean that the index portion to be scanned is much bigger than when using the index on only PathId, FilenameId.

So if the column order in the index works like this, then it should be even faster when defining the index with the Filenameid first like this:
CREATE INDEX file_fpid_idx ON File(Filenameid, Pathid);

I also tested with separate indexes on Pathid and Filenameid, but it turned out to be a bit slower than using the Multicolumn index on Pathid, Filenameid

I did not look into MySQL yet, but it seems that it uses "index merge" to combine separate indexes. Anyway, referring to http://www.mysqlperformanceblog.com/2008/08/22/multiple-column-index-vs-multiple-indexes/ it looks like multicolumn indexes are faster if properly set up.

Referring to http://www.postgresql.org/docs/8.4/static/indexes-bitmap-scans.html PostgreSQL combines indexes in this way:

"To combine multiple indexes, the system scans each needed index and prepares a bitmap in memory giving the locations of table rows that are reported as matching that index's conditions. The bitmaps are then ANDed and ORed together as needed by the query. "

This would explain a bit of overhead compared to a multicolumn index.

While it looks like multiple single column indexes can be used in a more flexible way by the query optimizer, I would opt for properly designed multicolumn indexes.

As we are only talking about integer columns here, I assume that the insert performance impact is not very big in that case.

Issue History

Date Modified Username Field Change
2013-06-13 08:23 pstorz New Issue
2013-06-21 16:47 joergs Assigned To => stephand
2013-06-21 16:47 joergs Status new => assigned
2013-07-05 14:35 stephand Note Added: 0000497
2013-07-05 18:55 stephand Note Added: 0000509
2013-08-08 12:08 AndiH Note Added: 0000567
2013-08-08 16:17 stephand Note Added: 0000568
2015-10-22 13:27 stephand Status assigned => closed
2015-10-22 13:27 stephand Resolution open => no change required