Obnam command line interface
I have some specific ideas for the command line interface I'm planning for my backup program. I'll be writing a man page for obnam, but before I do that, here's a sketch.
obnam backup --store sftp://example.com/~/backups/ $HOMEobnam ls --generation latestobnam verifyobnam fsckobnam restore --generation latest --to /var/tmp/liw.restoreòbnam forget --keep 1h:7d:5w:12m:99y
The backup command should be obvious. I'll make a configuration file so the location of the backup store can be specified there, rather than every time on the command line. Also other arguments, such as the directories to back up.
The ls command lists the contents of a backup generation.
The verify command compares what has been backed up with what is on the hard disk now, reporting differences. If you back up and then immediately verify, you can check that everything got backed up. Verify will also be able to do things like compare randomly selected files (rather than all of them). I am not yet sure exactly how the verification process should happen to make things trustable.
fsck checks that the internal data structures in the backup store are OK.
restore restores.
forget removes old backup generations. It will be able to remove specific generations, or apply a policy such as "keep one hour, seven daily, five weekly, twelve monthly, and lots of yearly generations". It will be cheap to keep lots of generations, since obnam will do heavy de-duplication, at the block level.
For verify, I'd presume some sort of random choice by age grouping (ie covering older files, middle aged files and new ones) as well as by a variety of different directories would help. Maybe you just tell the tool the maximum amount of time/data transfer to spend on verification and off it goes? One gotcha I have seen with backups is permissions problems. You run the backup as an ordinary user (its using your ssh keys etc) and don't notice that it failed to pickup some files maybe because it didn't complain or maybe because it does complain about many permissions problems so you don't see what you care about in the noise. Then later you discover hat actually /etc/shadow or /var/lib/couchdb would have been nice to have been backed up.
Have you considered a i/dnotify style listener? Instead of remembering to schedule backups, you leave the tool running pointed at a set of directories you care about, it automatically picks up changed files and backs them up in the near future. (You could obviously also do this with a separate unrelated monitor program repeated invoking the command line. I have no idea which approach is better.)
On the deduping side one problem is that some files are more important to me than others. Not only do I want the contents backed up, but I want the backup to be able to survive some number of failures of the backup data for that file. For example I'd want to ensure that there is deliberate duplication of the blocks making up some directories (tax returns, password lists, ssh/pgp stuff) so that a single bad sector on the backup server doesn't lose what is most important to me.
Verify: yeah, many ways of choosing what data to verify is going to be needed. My code already has plugin and hook systems, and I'll make it possible to extend the verification to pretty much anything, just in case my own ideas aren't universal.
Permission problems: failure to back up or verify something because of permissions need to be treated as a failure.
inotify/dnotify: certainly. I have no idea yet what is going to be the best way of using them, but they're certainly on the roadmap (or would be, if I had time to write the roadmap down).
Deduping/redundancy in the backup store: I could be talked into making it configurable how much de-duping needs doing. For my own needs, I would rather solve it by having backups of backups. I'm also thinking of having a plugin that adds some error correcting codes to the backup store.