0000268: RFE for disabling inline compression routine in sd to take advantage of deduplication filesystems

ID	Project	Category	View Status	Date Submitted	Last Update

0000268	bareos-core	storage daemon	public	2014-01-10 09:06	2023-08-02 17:22

Reporter	tonyalbers	Assigned To	bruno-at-bareos
Priority	normal	Severity	feature	Reproducibility	always
Status	closed	Resolution	suspended
Platform	All	OS	All	OS Version	All
Product Version	13.2.2

Summary	0000268: RFE for disabling inline compression routine in sd to take advantage of deduplication filesystems
Description	To my best knowledge, bareos-sd has some sort of compression algorithm or whitespace-removal to get the highest possible write speed on tape drives. This same feature is enabled when writing to file storage, and prevents dedupe filesystems like ZFS, Opendedup and SNFS from working correctly. Now, when using dedupe, you can usually get a reduction ratio of as much as 20:1 which can be even higher if compression is used.
Steps To Reproduce	Try to use a dedupe FS as storage for sd, and you will see that the data reduction ratio is only a few percent.
Additional Information	See attached screenshot which is from a Quantum DXi using SNFS with dedupe and compression enabled. Total data redustion ratio is 30:1, this means that this specific box holds 291TB backup data on 11TB of raw disk.
Tags	No tags attached.

tonyalbers 2014-01-10 09:06 reporter	Screenshot-19.png (35,704 bytes) Screenshot-19.png (35,704 bytes)

mvwieringen 2014-01-10 15:12 developer ~0000785	You seem to know more then I do about the storage daemon as I don't recall ever seeing code that does any compression in the SD or white space removal. We did add a plugin that allows you to do on the fly compression and decompression of data in the SD with version 13.4 but that is optional. The only compression that was performed up until now was in the filed. And there there is also a sparse file option that does white space removal if it detects a block with only nulls. The autodeflate SD module is exactly what should cure you from storing compressed data on your filesystem (and your hardware compressed tape drive) as it allows you to decompress data before writing it to disk or tape. I think the most important problem is that the current on disk format is the same as the tape data and that means that it writes the meta data in the same file as the actual file data and due to that mixing it kind make deduping hard. There are ways to split both the metadata and you probably have to block align the real data too. Problem is however that this is a patented solution by Bacula Systems so it would be hard to implement anything without touching that patent.

tonyalbers 2014-01-10 16:36 reporter ~0000786	"You seem to know more then I do about the storage daemon as I don't recall ever seeing code that does any compression in the SD or white space removal." Heh.. Yeah right ;) I was referring to what you mentioned as the merging of metadata and file data. I'm not a developer, so I have no idea what it takes to i.e. build a storage daemon that does not merge the data, but imo it would be very beneficial. Yes, I remember you mentioned the patent earlier, but I'm talking about saving the backup data and metadata to two separate files or a number of separate files and then letting the underlying filesystem do the deduplication. I honestly don't think that bacula can patent that. But again, I don't know the details. I'm just used to supporting EMC NetWorker, Symantec Netbackup, DataDomain and Quantum DXi's where dedupe is more a rule than an exception.

jvaughn 2014-01-23 02:06 reporter ~0000793 Last edited: 2014-01-23 02:52	Is splitting the metadata and block aligning the files to the underlying filesystem for dedup the method that Bacula patented? Could the MD5/SHA1 signature system already in place for file verification, be used to perform at least file-level dedup during backup (i.e. if we reach the end of the file and discover the signature (and possibly comparing the partial (without path) filename) matches an existing file we've got backed up, we just write out some metadata saying "Go look at this other file from earlier for this file's data" (sort of like a hardlink, but it's only in the metadata and would get restored as separate file) instead of the file itself? This would get file-level dedup regardless of storage medium for backups, or OS involved .. or is all this (seemingly obvious to this armchair architect) already patented 8 ways to sunday? We'd love to get file-level dedup ourselves for our backup usage scenario, as we have quite a few developers who have more-or-less identical copies of git repos checked out in their respective directories, with the only changes being those portions they're working on actively - there could be some massive space savings if we could get dedup'd backups. Each developer has on average about 70GB or so of checked out repos, but most of that is duplicated with the others ... in total it is 1.2TB currently, uncompressed as-is on the filesystem.

jvaughn 2014-01-27 22:00 reporter ~0000799	Having looked at how the file daemon is architected and communicates with the director and SD in more detail now, I understand that the above suggestion as-is is pretty useless. The FD would have to keep its own records of all the file information being used to determine duplicates (i.e. filesize and SHA1) or have access to the catalog, and would have to read each file fully to calculate the SHA1 before deciding if it should send it to the SD (reading it twice whenever it isn't a dupe, though in many cases it would be a cached read at that point for non-large files).

tonyalbers 2014-01-28 15:30 reporter ~0000800	Don't get sidetracked here guys. Although I'd love to see Bareos implement some sort of dedupe feature on their own, this RFE is not about that. What I want to do, is to be able to write data from SD to a filesystem in a format that allows the filesystem itself to do deduplication. This is how EMC NetWorker does it, and that allows you to use any supported filesystem on your backend. Symantec NetBackup does this the exact same way, unless you use PureDisk which in essence is a fileserver that does the dedupe. Is the autodeflate module available? I'd like to try it out on i.e. zfs.

mvwieringen 2014-01-28 16:16 developer ~0000801	Both options are interesting but at this moment most of these things have some serious patent issues so we need to be very careful as the whole dedup landscape is kind of seed with patents so we need to check that out first. As to the autodeflate plugin yes that is available in the master but I don't see the relationship to zfs you want to inflate the data and then let zfs compress it again ? The idea of the plugin is to be used when writing to hardware compression with tapes (e.g. first inflate the data so that the hardware compression can do its work.) or when using a filesystem that doesn't do compression e.g. first compress the data before writing it out to disk. This way we can do things Networker and Netbackup cannot do e.g. compress data on the fly.

tonyalbers 2014-01-29 10:42 reporter ~0000802	"Both options are interesting but at this moment most of these things have some serious patent issues so we need to be very careful as the whole dedup landscape is kind of seed with patents so we need to check that out first." Well, AFAIK deduplication as a technology is not globally patented, there are several ways to do it. And FS's like zfs and SDFS(OpenDeDup) are open-source. And again: I don't want Bareos to do any dedupe, I want the SD's underlying storage filesystem to do that. "you want to inflate the data and then let zfs compress it again ?" Yes, to take advantage of the fact that ZFS can do variable-length block-based deduplication followed by compression. This is way more effective than compression alone. What I'd like to do is something similiar to using dump from a standard filesystem a number of times to separate dumpfiles on a zfs filesystem with dedupe and compression turned on. Every single block will be copied every time I do it, but zfs will discard all the blocks it already knows about and replace them with internal pointers. So if I dump a single 100MB filesystem to seven different dumpfiles, these dumpfiles will take up 700MB on a standard filesystem, but only a little more than 100MB on zfs with dedupe, and even less when compression is enabled.

jvaughn 2014-02-17 19:59 reporter ~0000817	I apologize for having derailed your original feature request - the two methods (letting the filesystem dedup vs having bareos do the dedup) are entirely separate, and letting the filesystem do it will be much simpler to implement on bareos' side - it just doesn't work for us since we're using consumer / "prosumer" NAS devices for storing the backups and we can only access them via NFS and they don't support running ZFS natively. We might be able to run ZFS via loopback (haven't looked into that), though... The request for dedup in bareos itself should get split to another issue. I may do that once I have a few free minutes.

DavidHache 2014-11-06 22:35 reporter ~0001042	I just stumbled on this topic and it's exactly what I am trying to do. I am currently running a 1TB backup on my systems and need to send it over the WAN. Currently for testing, i am using local backups. I run a full with 1TB of data, and then incremental. Than, I do virtual fulls to avoid pushing the data over the WAN. This work quite well. Problem is to do so, I need an extra 1TB of disks space. On my small scale environment, this is fine but if I want to scale it up, this won't be practical or economical. I am currently backing up to disk on a ZFS file system with deduplication enabled. I would be interested in seeing the patents they hold for the "deduplication". My large scale clients use Commvault, Netbackup, Nexenta, and Netapp. All of theses products use some variant of deduplication but in reality , they all do the same thing. Hash a block of data, put it in a table and compare. It would be great to push more on this issue as I would be able to recommend this product easily. Who would say no to 90:1 dedupe ratio on 12 week cycle to disk with a mostly windows / rhel environment?

tonyalbers 2014-11-07 10:19 reporter ~0001044	David, that's my point exactly. And I've just added a screenshot from a DataDomain box which is a dedupe appliance. The backup software used is EMC NetWorker. Clients are a mix of Linux and Windows machines. Note the data reduction ratio. The Bareos developers should _seriously_ consider making it possible to use dedupe filesystems/appliances for storage.

tonyalbers 2014-11-07 10:21 reporter	reducratio.png (7,350 bytes) reducratio.png (7,350 bytes)

mvwieringen 2014-11-07 11:18 developer ~0001047	Its an opensource project so take a stab at it I would say. The whole dedup area is a mine-field with patents all over the place so unless you have a pool of money for a patent lawyer its something we will eventually have to tackle but there has not been a lot of business incentive.

DavidHache 2014-11-07 15:47 reporter ~0001054	I will go read up on the modules that write the data to disk. See what I can do.

joe 2015-02-05 21:05 reporter ~0001262	As this is the feature I'm missing most, it would be great to see something implemented. I know it won't be easy but there are other open source projects that implement even inline deduplication. Perhaps it would be good to get an idea how other do this: Burp: http://burp.grke.org/ Duplicati: http://www.duplicati.com/ Both project support deduplication and also inline deduplication (or are implementing it).

RTavernier 2016-12-22 10:02 reporter ~0002484	I dig up this old thread because I have the same need. I do backup disk to disk to disk. To a ZFS partition and a Quantum DXI. Both have native deduplication activated. I'd like to know if the format of the volumes are still not deduplication-friendly or is there a solution?

joergs 2016-12-22 14:25 developer ~0002488	Except from some internal tests, there are no news on this subject. It might be something to add to https://www.bareos.com/en/co-funding.html so it might get more priority.

bruno-at-bareos 2023-08-02 17:22 manager ~0005308	closing due to inactivity.

Date Modified	Username	Field	Change
2014-01-10 09:06	tonyalbers	New Issue
2014-01-10 09:06	tonyalbers	File Added: Screenshot-19.png
2014-01-10 15:12	mvwieringen	Note Added: 0000785
2014-01-10 15:13	mvwieringen	Assigned To	=> mvwieringen
2014-01-10 15:13	mvwieringen	Status	new => feedback
2014-01-10 16:36	tonyalbers	Note Added: 0000786
2014-01-10 16:36	tonyalbers	Status	feedback => assigned
2014-01-23 02:06	jvaughn	Note Added: 0000793
2014-01-23 02:52	jvaughn	Note Edited: 0000793
2014-01-27 22:00	jvaughn	Note Added: 0000799
2014-01-28 15:30	tonyalbers	Note Added: 0000800
2014-01-28 16:16	mvwieringen	Note Added: 0000801
2014-01-29 10:42	tonyalbers	Note Added: 0000802
2014-02-17 19:59	jvaughn	Note Added: 0000817
2014-05-16 17:49	mvwieringen	Assigned To	mvwieringen =>
2014-05-16 17:49	mvwieringen	Status	assigned => acknowledged
2014-11-06 22:35	DavidHache	Note Added: 0001042
2014-11-07 10:19	tonyalbers	Note Added: 0001044
2014-11-07 10:21	tonyalbers	File Added: reducratio.png
2014-11-07 11:18	mvwieringen	Note Added: 0001047
2014-11-07 15:47	DavidHache	Note Added: 0001054
2015-02-05 21:05	joe	Note Added: 0001262
2016-12-22 10:02	RTavernier	Note Added: 0002484
2016-12-22 14:25	joergs	Note Added: 0002488
2023-08-02 17:22	bruno-at-bareos	Assigned To	=> bruno-at-bareos
2023-08-02 17:22	bruno-at-bareos	Status	acknowledged => closed
2023-08-02 17:22	bruno-at-bareos	Resolution	open => suspended
2023-08-02 17:22	bruno-at-bareos	Note Added: 0005308

Reporting new Issues is disabled, please Report new Issues at https://github.com/bareos/bareos/issues

View Issue Details

Activities

Issue History