View Issue Details

IDProjectCategoryView StatusLast Update
0000268bareos-corestorage daemonpublic2023-08-02 17:22
Reportertonyalbers Assigned Tobruno-at-bareos  
PrioritynormalSeverityfeatureReproducibilityalways
Status closedResolutionsuspended 
PlatformAllOSAllOS VersionAll
Product Version13.2.2 
Summary0000268: RFE for disabling inline compression routine in sd to take advantage of deduplication filesystems
DescriptionTo my best knowledge, bareos-sd has some sort of compression algorithm or whitespace-removal to get the highest possible write speed on tape drives.
This same feature is enabled when writing to file storage, and prevents dedupe filesystems like ZFS, Opendedup and SNFS from working correctly.
Now, when using dedupe, you can usually get a reduction ratio of as much as 20:1 which can be even higher if compression is used.
Steps To ReproduceTry to use a dedupe FS as storage for sd, and you will see that the data reduction ratio is only a few percent.
Additional InformationSee attached screenshot which is from a Quantum DXi using SNFS with dedupe and compression enabled. Total data redustion ratio is 30:1, this means that this specific box holds 291TB backup data on 11TB of raw disk.
TagsNo tags attached.

Activities

tonyalbers

tonyalbers

2014-01-10 09:06

reporter  

Screenshot-19.png (35,704 bytes)   
Screenshot-19.png (35,704 bytes)   
mvwieringen

mvwieringen

2014-01-10 15:12

developer   ~0000785

You seem to know more then I do about the storage daemon as I don't recall
ever seeing code that does any compression in the SD or white space removal.

We did add a plugin that allows you to do on the fly compression and decompression of data in the SD with version 13.4 but that is optional.

The only compression that was performed up until now was in the filed. And there
there is also a sparse file option that does white space removal if it detects
a block with only nulls.

The autodeflate SD module is exactly what should cure you from storing compressed data on your filesystem (and your hardware compressed tape drive)
as it allows you to decompress data before writing it to disk or tape.

I think the most important problem is that the current on disk format is
the same as the tape data and that means that it writes the meta data in the
same file as the actual file data and due to that mixing it kind make deduping
hard.

There are ways to split both the metadata and you probably have to block align
the real data too. Problem is however that this is a patented solution by Bacula Systems so it would be hard to implement anything without touching that
patent.
tonyalbers

tonyalbers

2014-01-10 16:36

reporter   ~0000786

"You seem to know more then I do about the storage daemon as I don't recall
ever seeing code that does any compression in the SD or white space removal."

Heh.. Yeah right ;)

I was referring to what you mentioned as the merging of metadata and file data.

I'm not a developer, so I have no idea what it takes to i.e. build a storage daemon that does not merge the data, but imo it would be very beneficial.

Yes, I remember you mentioned the patent earlier, but I'm talking about saving the backup data and metadata to two separate files or a number of separate files and then letting the underlying filesystem do the deduplication. I honestly don't think that bacula can patent that. But again, I don't know the details. I'm just used to supporting EMC NetWorker, Symantec Netbackup, DataDomain and Quantum DXi's where dedupe is more a rule than an exception.
jvaughn

jvaughn

2014-01-23 02:06

reporter   ~0000793

Last edited: 2014-01-23 02:52

Is splitting the metadata and block aligning the files to the underlying filesystem for dedup the method that Bacula patented?

Could the MD5/SHA1 signature system already in place for file verification, be used to perform at least file-level dedup during backup (i.e. if we reach the end of the file and discover the signature (and possibly comparing the partial (without path) filename) matches an existing file we've got backed up, we just write out some metadata saying "Go look at this other file from earlier for this file's data" (sort of like a hardlink, but it's only in the metadata and would get restored as separate file) instead of the file itself?

This would get file-level dedup regardless of storage medium for backups, or OS involved .. or is all this (seemingly obvious to this armchair architect) already patented 8 ways to sunday?

We'd love to get file-level dedup ourselves for our backup usage scenario, as we have quite a few developers who have more-or-less identical copies of git repos checked out in their respective directories, with the only changes being those portions they're working on actively - there could be some massive space savings if we could get dedup'd backups. Each developer has on average about 70GB or so of checked out repos, but most of that is duplicated with the others ... in total it is 1.2TB currently, uncompressed as-is on the filesystem.

jvaughn

jvaughn

2014-01-27 22:00

reporter   ~0000799

Having looked at how the file daemon is architected and communicates with the director and SD in more detail now, I understand that the above suggestion as-is is pretty useless. The FD would have to keep its own records of all the file information being used to determine duplicates (i.e. filesize and SHA1) or have access to the catalog, and would have to read each file fully to calculate the SHA1 before deciding if it should send it to the SD (reading it twice whenever it isn't a dupe, though in many cases it would be a cached read at that point for non-large files).
tonyalbers

tonyalbers

2014-01-28 15:30

reporter   ~0000800

Don't get sidetracked here guys. Although I'd love to see Bareos implement some sort of dedupe feature on their own, this RFE is not about that.
What I want to do, is to be able to write data from SD to a filesystem in a format that allows the filesystem itself to do deduplication. This is how EMC NetWorker does it, and that allows you to use any supported filesystem on your backend. Symantec NetBackup does this the exact same way, unless you use PureDisk which in essence is a fileserver that does the dedupe.
Is the autodeflate module available? I'd like to try it out on i.e. zfs.
mvwieringen

mvwieringen

2014-01-28 16:16

developer   ~0000801

Both options are interesting but at this moment most of these things have
some serious patent issues so we need to be very careful as the whole dedup
landscape is kind of seed with patents so we need to check that out first.
As to the autodeflate plugin yes that is available in the master but I don't
see the relationship to zfs you want to inflate the data and then let zfs
compress it again ? The idea of the plugin is to be used when writing to
hardware compression with tapes (e.g. first inflate the data so that the
hardware compression can do its work.) or when using a filesystem that doesn't
do compression e.g. first compress the data before writing it out to disk.
This way we can do things Networker and Netbackup cannot do e.g. compress data
on the fly.
tonyalbers

tonyalbers

2014-01-29 10:42

reporter   ~0000802

"Both options are interesting but at this moment most of these things have
some serious patent issues so we need to be very careful as the whole dedup
landscape is kind of seed with patents so we need to check that out first."
Well, AFAIK deduplication as a technology is not globally patented, there are several ways to do it. And FS's like zfs and SDFS(OpenDeDup) are open-source.
And again: I don't want Bareos to do any dedupe, I want the SD's underlying storage filesystem to do that.

"you want to inflate the data and then let zfs
compress it again ?"
Yes, to take advantage of the fact that ZFS can do variable-length block-based deduplication followed by compression. This is way more effective than compression alone.

What I'd like to do is something similiar to using dump from a standard filesystem a number of times to separate dumpfiles on a zfs filesystem with dedupe and compression turned on. Every single block will be copied every time I do it, but zfs will discard all the blocks it already knows about and replace them with internal pointers. So if I dump a single 100MB filesystem to seven different dumpfiles, these dumpfiles will take up 700MB on a standard filesystem, but only a little more than 100MB on zfs with dedupe, and even less when compression is enabled.
jvaughn

jvaughn

2014-02-17 19:59

reporter   ~0000817

I apologize for having derailed your original feature request - the two methods (letting the filesystem dedup vs having bareos do the dedup) are entirely separate, and letting the filesystem do it will be much simpler to implement on bareos' side - it just doesn't work for us since we're using consumer / "prosumer" NAS devices for storing the backups and we can only access them via NFS and they don't support running ZFS natively. We might be able to run ZFS via loopback (haven't looked into that), though...

The request for dedup in bareos itself should get split to another issue. I may do that once I have a few free minutes.
DavidHache

DavidHache

2014-11-06 22:35

reporter   ~0001042

I just stumbled on this topic and it's exactly what I am trying to do.
I am currently running a 1TB backup on my systems and need to send it over the WAN.
Currently for testing, i am using local backups.

I run a full with 1TB of data, and then incremental. Than, I do virtual fulls to avoid pushing the data over the WAN. This work quite well. Problem is to do so, I need an extra 1TB of disks space. On my small scale environment, this is fine but if I want to scale it up, this won't be practical or economical.

I am currently backing up to disk on a ZFS file system with deduplication enabled.

I would be interested in seeing the patents they hold for the "deduplication". My large scale clients use Commvault, Netbackup, Nexenta, and Netapp. All of theses products use some variant of deduplication but in reality , they all do the same thing. Hash a block of data, put it in a table and compare.

It would be great to push more on this issue as I would be able to recommend this product easily. Who would say no to 90:1 dedupe ratio on 12 week cycle to disk with a mostly windows / rhel environment?
tonyalbers

tonyalbers

2014-11-07 10:19

reporter   ~0001044

David, that's my point exactly. And I've just added a screenshot from a DataDomain box which is a dedupe appliance. The backup software used is EMC NetWorker. Clients are a mix of Linux and Windows machines. Note the data reduction ratio. The Bareos developers should _seriously_ consider making it possible to use dedupe filesystems/appliances for storage.
tonyalbers

tonyalbers

2014-11-07 10:21

reporter  

reducratio.png (7,350 bytes)   
reducratio.png (7,350 bytes)   
mvwieringen

mvwieringen

2014-11-07 11:18

developer   ~0001047

Its an opensource project so take a stab at it I would say.

The whole dedup area is a mine-field with patents all over the place so
unless you have a pool of money for a patent lawyer its something we will
eventually have to tackle but there has not been a lot of business incentive.
DavidHache

DavidHache

2014-11-07 15:47

reporter   ~0001054

I will go read up on the modules that write the data to disk.
See what I can do.
joe

joe

2015-02-05 21:05

reporter   ~0001262

As this is the feature I'm missing most, it would be great to see something implemented. I know it won't be easy but there are other open source projects that implement even inline deduplication. Perhaps it would be good to get an idea how other do this:
Burp: http://burp.grke.org/
Duplicati: http://www.duplicati.com/
Both project support deduplication and also inline deduplication (or are implementing it).
RTavernier

RTavernier

2016-12-22 10:02

reporter   ~0002484

I dig up this old thread because I have the same need.
I do backup disk to disk to disk. To a ZFS partition and a Quantum DXI.
Both have native deduplication activated.

I'd like to know if the format of the volumes are still not deduplication-friendly or is there a solution?
joergs

joergs

2016-12-22 14:25

developer   ~0002488

Except from some internal tests, there are no news on this subject. It might be something to add to https://www.bareos.com/en/co-funding.html so it might get more priority.
bruno-at-bareos

bruno-at-bareos

2023-08-02 17:22

manager   ~0005308

closing due to inactivity.

Issue History

Date Modified Username Field Change
2014-01-10 09:06 tonyalbers New Issue
2014-01-10 09:06 tonyalbers File Added: Screenshot-19.png
2014-01-10 15:12 mvwieringen Note Added: 0000785
2014-01-10 15:13 mvwieringen Assigned To => mvwieringen
2014-01-10 15:13 mvwieringen Status new => feedback
2014-01-10 16:36 tonyalbers Note Added: 0000786
2014-01-10 16:36 tonyalbers Status feedback => assigned
2014-01-23 02:06 jvaughn Note Added: 0000793
2014-01-23 02:52 jvaughn Note Edited: 0000793
2014-01-27 22:00 jvaughn Note Added: 0000799
2014-01-28 15:30 tonyalbers Note Added: 0000800
2014-01-28 16:16 mvwieringen Note Added: 0000801
2014-01-29 10:42 tonyalbers Note Added: 0000802
2014-02-17 19:59 jvaughn Note Added: 0000817
2014-05-16 17:49 mvwieringen Assigned To mvwieringen =>
2014-05-16 17:49 mvwieringen Status assigned => acknowledged
2014-11-06 22:35 DavidHache Note Added: 0001042
2014-11-07 10:19 tonyalbers Note Added: 0001044
2014-11-07 10:21 tonyalbers File Added: reducratio.png
2014-11-07 11:18 mvwieringen Note Added: 0001047
2014-11-07 15:47 DavidHache Note Added: 0001054
2015-02-05 21:05 joe Note Added: 0001262
2016-12-22 10:02 RTavernier Note Added: 0002484
2016-12-22 14:25 joergs Note Added: 0002488
2023-08-02 17:22 bruno-at-bareos Assigned To => bruno-at-bareos
2023-08-02 17:22 bruno-at-bareos Status acknowledged => closed
2023-08-02 17:22 bruno-at-bareos Resolution open => suspended
2023-08-02 17:22 bruno-at-bareos Note Added: 0005308