I've released Obnam version 1.1, released 2012-06-30, but only announced now since I had trouble building the packages for code.liw.fi.
- Mark the
--small-files-in-btreesettings as deprecated. - Obnam now correctly checks that
--repositoryis set. - Options in
--helpoutput are now grouped in random senseless ways rather than being in one randomly ordered group. - Manual page clarification for
--rootandverify. Thanks, Saint Germain. - Remove outdated section from manual page explaining that there is not format conversion. Thanks, Elrond of Samba-TNG.
- Added missing information about specifying a user in sftp URLs. Thanks, Joey Hess, for pointing it out.
- Manual page clarification on
--keepfrom Damien Couroussé. - Make
obnam forgetreport which generations it would remove without--pretend. Thanks, Neal Becker, for the suggestion.
I've been a dump/restore user for years, but with the arrival of filesystems like btrfs I found myself looking for alternatives, which brought me to obnam.
While I love the idea of deduplication, the fact is my time is more precious than my disk space, and unfortunately obnam is just far too slow to be useful.
For example, I just dumped a 75GB (used) ext4 filesystem from an LVM2 snapshot, four times, twice with dump (level 0 and 1 respectively) and twice with obnam (first full then a diff the next day). The results were disappointing to say the least.
The level 0 dump, from about two months ago, took about 6.5 hours (slow netbook and USB2 backup drive). Today's level 1 dump took 55 minutes for a diff of about 11GB.
By comparison, yesterday's full (i.e. first) obnam backup took 8 hours on the same machine to the same USB2 drive, and today's diff, which comprised a whopping total of 50MB, took a staggering ... 5.5 hours.
Seriously, 5.5 hours ... for 50MB?!?!?
Sorry, dump may be getting a bit long in the tooth, and it doesn't do delta deduplication (only the complete changed file), but at least the backup time is proportional to the size of the diff, which means most of the time it completes in just a few minutes.
I can put up with long monthly backups and the protracted (but very rare) restores, for the sake of spending less time and computer resources on consistently long daily backups.
In fact, I just realised that if I were to try to backup my 2TB media drive using obnam, I wouldn't be able to do daily backups at all, because it'd take more than 24 hours to do each backup, even if the daily diff was only a few MB!
I'm going back to ext* filesystems and sticking with dump.
Thanks anyway.
Yes, Obnam is slow in many situations. On the other hand, it's fast in other situations. My work laptop (around 200 gigabytes of data, up to gigabytes of delta) usually takes about two minutes to make a daily backup to another machine in the office. The speed of Obnam backups is not simply based on the size of data, but also on number of files, and probably other factors.
The speed issue is also well-known, and something I acknowlege. I decided to make Obnam be correct before being fast.
The important thing is not whether you use Obnam, but that you have a working backup system.
Could you explain in plain English (my Python sucks) the basic methodology you're using to calculate, compare and store deltas, as I'm interested in learning more about this process, with a view to experimenting with different approaches to minimize overhead and improve speed (if possible)?
Without knowing the details, I suspect any deduplication method entails an unavoidably expensive overhead, since even unmodified sources (mtime) may contain blocks duplicated in new targets, and naturally the only way you can know is by checking both (which is why my daily backup took nearly as long as the full/initial backup).
Perhaps limiting deduplication to individual files might help, because then you could ignore anything that hadn't been modified, only store the per-file deltas on modified files, and new files in full without deduplication (until later modified). This is less space-efficient but much faster. As it stands, I believe you're deduplicating the entire data set en mass, in a series of small, unique chunks, which necessitates checking the entire data set each time, comparing each chunk to all the others in the data set. Correct me if I'm wrong, this is just my initial impression.
Are you comparing the actual data or just the hashes? Are you using a catalogue/database or working directly on the data? What sort algorithm are you using? Would any of this benefit from being rewritten in C + inline assembler? Are you making use of threads and/or SMP? These are all factors that affect speed.
This isn't a subject I've previously given much thought, but you've piqued my interest. It seems like an interesting challenge.
I am afraid your assumptions of how Obnam works are rather mistaken. I hope the above page is illuminating. See http://liw.fi/obnam/ondisk/ for a description of the Obnam on-disk data structures. It should clarify how de-duplication happens.
If you have further questions, please use the Obnam mailing list.