Obnam storage API
The central data structure in Obnam is the way it stores backed up data on disk. This is the area I have struggled with most in the four years I've been sporadically developing Obnam.
My initial attempt was roughly this: everything was put in the backup store as a sort of object, which I'll call backup object. This included file contents, deltas between versions of a file, file metadata, and filenames. While the representation was quite different, essentially each of these objects was a list of key-value pairs:
file:
id = 12765
basename = "/home/liw/foobar/foobar.c"
st_mtime = 32
contref = 42
contents:
id = 42
data = "/* foobar.c -- a program to make foo do bar */\n..."
generation:
id = 105
file = "/home/liw/foobar/foobar.c", 12765
file = "/home/liw/foobar/README", 32765
...
Each generation consists of a list of filenames and pointers to the object that represents the version of the file in that generation. If a file has not changed from generation to generation, the pointer (and thus the file contents) from the previous generation is reused.
This was pretty simple, but it repeated the entire list of files, with names for each generation. The filenames take a surprising amount of space. Some statistics from my laptop:
Number of files: 401509
Basenames: 6 MiB
Pathnames: 27 MiB
It is ridiculous to store the full list of files (whether basenames or pathnames) for each generation. Even just the basenames will use more than a typical delta between each backup run, for me. This is clearly not acceptable.
After I realized this, I set to fix this by storing only changed filenames. I got this to work, but for various reasons it was very slow, and the complexity of the code made it hard to improve.
Instead of using a pathname as an index to a hashtable, as before, I was now building a duplicate of the filesystem's directory tree in my backup store. Each directory and file was represented by by a backup object, and the generation only held a list of root objects (essentially, the root directory).
When making a new backup, I would carefully do an update from the bottom of the filesystem directory tree upwards, doing copy-on-write updates on any backup objects that had changed since the previous backup. While this is reasonably straightforward to do, it made the code unnecessarily complicated. The code to do backups had to worry about functional updates to trees, which really isn't its business.
The fundamental cause for this misplaced complexity was that the backup store API was using object identifiers as keys, whereas backups (and restores and other operations) really want to handle filenames.
My current approach in the second complete rewrite is to return to pathname based indexing, but keep the copy-on-write behavior. I do not yet know how I will implement this, but I do know I need to keep all the complexity inside the backup store implementation. Right now I am concentrating on finding the best API for the store so that the rest of the program will be easy to write.
It's important that the API be non-tedious to use. There's a lot of room for exploration in backups for what to back up and when, and in which order. There's even further room for exploration in doing stuff with backed up data: verification, FUSE filesystems, etc. If the store API is tedious, it'll be harder to do all those nice things. If it is easy, they'll be that much easier to do.
I have hacked up a first draft of the store API. Before I discuss it, I'll give outlines of how the backup is coded, in pseudo-Python:
def backup(directories):
for each directory:
backup_directory(directory)
def backup_directory(dirname):
for each file directory:
backup_file(filename)
backup_metadata(dirname)
def backup_file(filename):
if file has changed:
backup_file_contents(filename)
backup_metadata(filename)
def backup_file_contents(filename):
for each chunk in file:
if chunk exists in store already:
remember its id
else:
put chunk into store and remember new id
set chunk ids for filename
def backup_metadata(pathname):
read metadata from filesystem
put metadata into store
That's about as straightforward as one can imagine. The store API is starting to emerge (semi-real-Python):
class Store:
def create(self, pathname):
def set_metadata(self, pathname, metadata):
def set_file_chunks(self, pathname, chunkids):
def find_chunk(self, data):
def put_chunk(self, data):
However, this is not quite ready yet. There is, for example, no concept of generations. After some playing around and discussions with Richard Braakman, I've ended up with the following approach.
A new generation is initially created as a clone of the previous generation (or empty, if it is the first generation). The new clone can be modified, in a copy-on-write fashion, and when all changes are done, they can be committed into the store. After that, the generation is immutable, and cannot be changed anymore.
This results in small changes to the main backup routine:
def backup(directories):
start new generation
for each directory:
backup_directory(directory)
commit started generation
And a couple of new methods to the Store class:
def start_generation(self):
def commit_generation(self):
Backups will now work reasonably efficiently, yet the code is simple. The complexity is all nicely hidden in the Store class.
Restoring should also be easy:
def restore():
restore_directory(generation_id, '/')
def restore_directory(genid, dirname):
create target directory on output filesystem
for each item in the directory in the generation in the store:
if it is a directory:
restore_directory(genid, sub-directory name)
else:
restore_file(genid, full pathname to file)
restore target directory metadata
def restore_file(genid, filename):
for each chunk in file:
read chunk
write to output file
restore file metadata
The store API needs a couple of new things:
def listdir(self, genid, dirname):
def get_metadata(self, genid, pathname):
def get_file_chunks(self, genid, filename):
There's a little bit more to it to handle hardlinks, symlinks, and other special cases, but this is basically what the API will now look like.
I have imlemented a proof-of-concept version of the API to allow
me to play with it, and see what the rest of the code would look
like. I am still assuming that using something like the funcational
B-trees in btrfs will be a good way to implement it properly, but
the API is not assuming that, I hope. (The code is slightly
different from the above snippets. If you want to have look at the
actual code, bzr get
http://code.liw.fi/obnam/bzr/rewrite4/ will get you a
copy.)
So far, I am happy with this. There's a whole bunch of questions remaining that I will get to. Right now the thing that worries me most is finding chunks in the backup store: can I implement it efficiently enough that it will be useful. Some version of this will need to be done, so that I can de-duplicate data in the filesystem. For example, if I move a ISO file to a new place and make some small changes to it, it would be disastrous if I had to back it up completely, even though almost all data is already in the backup store.
I am not sure how much effort to put into the de-duplication. It involves trade-offs that may depend on things like available bandwidth and bandwidth caps. It may be necessary to make it configurable: a user with vast amounts of bandwidth and disk space might not care, but someone travelling around the world and relying on hotel Inetnyet connections might care very much.
I'm running an experiment right now to see how much duplicate data there is on my laptop. My approach is to compute a checksum for each 4 kilobyte block at 64 byte intervals and then find duplicate checksums. Since I have quite a bit of data on my laptop, this is a pretty big computation, so it'll be a while before I get results.
It sounds like you are re-implementing the object store on which the Git version control system is built...
PS: your ikiwiki does not allow Unicode in comments.
I've worked on deduping products (network based, not storage). The way they work is to have a hash function which you apply over a range of data. To catch shifts in data (insertions, deletions) you calculate hashes over multiple overlapping ranges. You can also do it for multiple sizes (eg 16kb, 8kb, 1kb). We went right down to 32 bytes! The interesting part is that you discard most of the hashes calculated. For example we discarded any that did not end in some number of trailing zeroes. This makes storing the hashes smaller since you don't need to store the trailing zeroes while still having enough to provide useful deduping.
There is a tradeoff between how many times you calculate a hash over a range of data and the resulting deduplication efficiency. (Network has it somewhat easy as the largest chunk of data at once is the size of an ethernet frame, but then also has real time-ish requirements.) If you design your hash function well you can have multiple output values as you roll through the data. (Note that you do not need to use this hash in the storage system - a strong sha can be used - the point is to find potential duplicates quickly.)
Somewhat surprisingly we also had the benefit of gzip style compression. Data is effectively a list of intertwined hashes and literals and gzip on that output gave about another 50% compression much of the time.