I've decided to resurrect development of my backup program, Obnam. This time I thought I'd babble about it in public as I develop it, rather than try to present the world with a finished product.

I have not been happy with any backup solution I've tried. I have some fairly specific requirements:

  • Backups must be stored either on a local hard disk, or online. I don't care at all about tapes, optical media, or anything else that requires repetitive manual work.
  • Server end must be under my control as well. No Amazon S3 for me.
  • Both push and pull backups.
  • Backups must be encrypted at client end.
  • Backups must be incremental, but each generation must look like a full snapshot.
  • Backups must use checkpoints: network connections break, and if they do, the next backup must continue from most recent checkpoint.
  • Setup must be easy. Backups are important, but if they're at all any kind of pain, I and most others will just postpone them to a future day and one day it will be too late.
  • Fast. If I do some e-mail and write some code while drinking a smoothie in a net cafe, by the time I finish the drink and put away the laptop the backup must be finished.
  • Deals sensibly both with slow and fast networks. An incremental backup should not download any data from server, and should only upload the delta from the previous backup, plus minimal overhead.
  • Reliable. Backups should not require attention. I should just be allowed to assume they work. This also requires unobtrusive feedback that they're OK, and proper error reporting when something is wrong and does require my attention.

It's been a while since I did a proper survey, so things may have changed since, but so far, I've never found a system that I like. If you know of one, please don't tell me. I am now deep into thinking about the technical problems I will need to solve, and not that interested in finding an existing solution anymore.

If "hubris" was spelled with an i, it would be my middle name.

I have some code sketched out, but nothing that does anything useful yet. I've been playing with the internal architecture, and the interface and abstraction I will want for the "storage subsystem" that stores the backed up data. I have not decided yet how to implement the storage subsystem, but btrfs B-trees interest me a lot.

Some more concerns

You need to consider restoring and testing. For example what are the dependencies of the restore program - will it run on any recent Linux LiveCD or do you have to have a bazillion libraries installed and only the latest 64bit Ubuntu? Is there some sort of simple integration possible with Nautilus?

For testing, at some point you want to validate your backups, probably by pointing to a backup and to a local machine and have the test tell you how they diverge.

You also haven't mentioned what should be done about multiple machines (ie if you want to backup more than one). Using block hashes (see recent postings about ZFS deduplication settings) it becomes a lot cheaper to backup multiple machines into the same space since they will have many blocks the same. And once you consider laptops, desktops, servers, virtual machines etc you soon find you have a lot of "computers" and it would be great to be able to back them all up conveniently.

Comment by rogerbinns.com Mon Jan 11 08:16:43 2010
comment 2

I don't actually find it particularly relevant what the dependencies are for restoring data, but anyway, the dependencies will be the same as for backup: python, paramiko, gpg. The reason the dependencies are not that relevant is that if you can boot off a live-cd, dependencies are easy to fulfill, and I would not be stupid enough to require particular hardware (64-bit in your example).

Validation falls under reliability.

De-duplication falls under "fast". If a chunk of data has already been backed up, in another file or by another computer, then it obviously should not have to be uploaded again.

Any number of computers will of course work.

Comment by liw.fi Mon Jan 11 08:52:01 2010
bzr's 2a format
Has many of the internal properties you'd need. It has some extra data, so I wouldn't use it as-is.
Comment by lifeless [launchpad.net] Tue Jan 12 05:28:20 2010
bzr format docs?
@lifeless: is there high-level documentation for the 2a format? I've seen one presentation by you on it, and that was inspiring to me for a previous incarnation of obnam, but didn't seem like it would let me easily re-use chunks of data from any version of any file from any host using the same backup store. Also, I'd like to know if the bzr format would work well with encryption.
Comment by liw.fi Tue Jan 12 09:01:22 2010