liw's blog

Photo by Soile Mottisenkangas

Welcome to my web log. See the first post for an introduction. See the archive page for all posts. See also identi.ca.

Cable management while travelling

We're travelling and we have several electronic devices with us. This means we have many cables. Cables are difficult enough at home, but especially so while travelling.

My current best approach is to put each cable in a small clear plastic bag (zip lock bag, I think they're called). This prevents the cables from getting entangled, but there's so many of them that it's still hard to keep them in order.

I wonder if it would be possible to develop a better solution? My best idea so far is a long piece of fabric with pieces of velcro sewn into it. The velcro would be located so that it would be possible to neatly put each cable in place and then roll the whole piece of fabric into a neat roll.

When it would be time to get a cable, one would unroll the fabric, and then unfasten one or two pieces of velcro to get the cable.

Perhaps pockets instead of velcro?

Anyone have better ideas? Anyone have an actual solution? I live in mortal dread of waking up one morning and learning that my cables have started to breed, and have decided to overthrow their master, and have strangled me to death while I slept.

Computer driving licenses

Various countries have a training programme called the "computer driving license". The training aims to give basic computer using skills (word processing, spreadsheets, the web, etc). It's good for people unsure of their skills, but I object to the name.

I think it's worrying that it's called a license of any kind, since that implies that there is an entity whose permission people need to use a computer. Licenses to own and operate copying machines or typewriters have existed, and it's always a sign of political oppression. It's just a word, but words have power, or at least they give leverage to those in power.

Obnam storage API

The central data structure in Obnam is the way it stores backed up data on disk. This is the area I have struggled with most in the four years I've been sporadically developing Obnam.

My initial attempt was roughly this: everything was put in the backup store as a sort of object, which I'll call backup object. This included file contents, deltas between versions of a file, file metadata, and filenames. While the representation was quite different, essentially each of these objects was a list of key-value pairs:

file:
    id = 12765
    basename = "/home/liw/foobar/foobar.c"
    st_mtime = 32
    contref = 42

contents:
    id = 42
    data = "/* foobar.c -- a program to make foo do bar */\n..."

generation:
    id = 105
    file = "/home/liw/foobar/foobar.c", 12765
    file = "/home/liw/foobar/README", 32765
    ...

Each generation consists of a list of filenames and pointers to the object that represents the version of the file in that generation. If a file has not changed from generation to generation, the pointer (and thus the file contents) from the previous generation is reused.

This was pretty simple, but it repeated the entire list of files, with names for each generation. The filenames take a surprising amount of space. Some statistics from my laptop:

Number of files: 401509
Basenames: 6 MiB
Pathnames: 27 MiB

It is ridiculous to store the full list of files (whether basenames or pathnames) for each generation. Even just the basenames will use more than a typical delta between each backup run, for me. This is clearly not acceptable.

After I realized this, I set to fix this by storing only changed filenames. I got this to work, but for various reasons it was very slow, and the complexity of the code made it hard to improve.

Instead of using a pathname as an index to a hashtable, as before, I was now building a duplicate of the filesystem's directory tree in my backup store. Each directory and file was represented by by a backup object, and the generation only held a list of root objects (essentially, the root directory).

When making a new backup, I would carefully do an update from the bottom of the filesystem directory tree upwards, doing copy-on-write updates on any backup objects that had changed since the previous backup. While this is reasonably straightforward to do, it made the code unnecessarily complicated. The code to do backups had to worry about functional updates to trees, which really isn't its business.

The fundamental cause for this misplaced complexity was that the backup store API was using object identifiers as keys, whereas backups (and restores and other operations) really want to handle filenames.

My current approach in the second complete rewrite is to return to pathname based indexing, but keep the copy-on-write behavior. I do not yet know how I will implement this, but I do know I need to keep all the complexity inside the backup store implementation. Right now I am concentrating on finding the best API for the store so that the rest of the program will be easy to write.

It's important that the API be non-tedious to use. There's a lot of room for exploration in backups for what to back up and when, and in which order. There's even further room for exploration in doing stuff with backed up data: verification, FUSE filesystems, etc. If the store API is tedious, it'll be harder to do all those nice things. If it is easy, they'll be that much easier to do.

I have hacked up a first draft of the store API. Before I discuss it, I'll give outlines of how the backup is coded, in pseudo-Python:

def backup(directories):
    for each directory:
        backup_directory(directory)

def backup_directory(dirname):
    for each file directory:
        backup_file(filename)
    backup_metadata(dirname)

def backup_file(filename):
    if file has changed:
        backup_file_contents(filename)
        backup_metadata(filename)

def backup_file_contents(filename):
    for each chunk in file:
        if chunk exists in store already:
            remember its id
        else:
            put chunk into store and remember new id
    set chunk ids for filename

def backup_metadata(pathname):
    read metadata from filesystem
    put metadata into store

That's about as straightforward as one can imagine. The store API is starting to emerge (semi-real-Python):

class Store:

    def create(self, pathname):
    def set_metadata(self, pathname, metadata):
    def set_file_chunks(self, pathname, chunkids):
    def find_chunk(self, data):
    def put_chunk(self, data):

However, this is not quite ready yet. There is, for example, no concept of generations. After some playing around and discussions with Richard Braakman, I've ended up with the following approach.

A new generation is initially created as a clone of the previous generation (or empty, if it is the first generation). The new clone can be modified, in a copy-on-write fashion, and when all changes are done, they can be committed into the store. After that, the generation is immutable, and cannot be changed anymore.

This results in small changes to the main backup routine:

def backup(directories):
    start new generation
    for each directory:
        backup_directory(directory)
    commit started generation

And a couple of new methods to the Store class:

def start_generation(self):
def commit_generation(self):

Backups will now work reasonably efficiently, yet the code is simple. The complexity is all nicely hidden in the Store class.

Restoring should also be easy:

def restore():
    restore_directory(generation_id, '/')

def restore_directory(genid, dirname):
    create target directory on output filesystem
    for each item in the directory in the generation in the store:
        if it is a directory:
            restore_directory(genid, sub-directory name)
        else:
            restore_file(genid, full pathname to file)
    restore target directory metadata

def restore_file(genid, filename):
    for each chunk in file:
        read chunk
        write to output file
    restore file metadata

The store API needs a couple of new things:

def listdir(self, genid, dirname):
def get_metadata(self, genid, pathname):
def get_file_chunks(self, genid, filename):

There's a little bit more to it to handle hardlinks, symlinks, and other special cases, but this is basically what the API will now look like.

I have imlemented a proof-of-concept version of the API to allow me to play with it, and see what the rest of the code would look like. I am still assuming that using something like the funcational B-trees in btrfs will be a good way to implement it properly, but the API is not assuming that, I hope. (The code is slightly different from the above snippets. If you want to have look at the actual code, bzr get http://code.liw.fi/obnam/bzr/rewrite4/ will get you a copy.)

So far, I am happy with this. There's a whole bunch of questions remaining that I will get to. Right now the thing that worries me most is finding chunks in the backup store: can I implement it efficiently enough that it will be useful. Some version of this will need to be done, so that I can de-duplicate data in the filesystem. For example, if I move a ISO file to a new place and make some small changes to it, it would be disastrous if I had to back it up completely, even though almost all data is already in the backup store.

I am not sure how much effort to put into the de-duplication. It involves trade-offs that may depend on things like available bandwidth and bandwidth caps. It may be necessary to make it configurable: a user with vast amounts of bandwidth and disk space might not care, but someone travelling around the world and relying on hotel Inetnyet connections might care very much.

I'm running an experiment right now to see how much duplicate data there is on my laptop. My approach is to compute a checksum for each 4 kilobyte block at 64 byte intervals and then find duplicate checksums. Since I have quite a bit of data on my laptop, this is a pretty big computation, so it'll be a while before I get results.

LCA, rest of the week

Tuesday: Gabriella Coleman's keynote about the origins and impact of the free software and hacker communities on the rest of the world was wonderful. Missed other talks, feeling very "shell shocked" and maybe culture shocked, and not really wanting to talk to people or hear people talk. Did see Blackheath's Haskell talk, which was a basic overview of Haskell features.

Wednesday: Mako Hill's keynote was very inspiring. Concepts of autonomy and anti-features are good. Matthew Garrett's "Making yourself popular" talk was good, though perhpas a bit shallow. JobsBOF was a washout for me, nothing interesting there. Roger Fenwick's "World's worst inventions" was funny.

fenwick.jpg

(The above photo is rather bad. Sorry. I did not feel like carrying around with a real camera so I made do with the phone's.)

Thursday: Glyn Moody's keynote quite exceptionally good. He's one of my favorite two IT journalists. (The other one, Jon Corbet, was also at LCA, though I missed his talk and failed to talk to him.) Skipped the rest of the conference day, as Soile and I went and opened bank accounts and shopped for a car.

Friday: Lighting talks were ok, it's a good concept. Photo management BOF not too exciting, but interesting to hear that most people think tagging is too much work to be practical. I might want to make Dimbola be really good at that. Martin Krafft was late for the DebianBOF so I chaired/secretaried it. Lots of discussions, I almost felt it was my crowd still.

Penguin dinner in the evening was a disappointment from my point of view. Too many people, too much noise, I did not hear much and was slightly miserable. I should learn some day that I do not thrive in noisy crowds. I did, however, draw a penguin on my phone while there.

penguin.png

Saturday: LCA Open Day, talked to a bunch of people from companies, handed out my business card. A company called Lucid is doing a backup program called LBackup, free software, I might want to collaborate with them, given my continued interest in Obnam.

reprap.jpg

Also, saw a RepRap. Stunningly cool. A glimpse of the future.

LCA2010 Monday

I'm at the the LCA2010 conference in Wellington. Today was the first day, with miniconfs. A few notes:

  • Stephen Blackheath: Haskell, and all the wonderful things it doesn’t let you do. An overview talk of what Haskell is all about. I really need to get back to reading the Real World Haskell book.
  • Kate Stewart: Sharing Package Copyright and Licensing Data Effectively. An overview of the dilemma a distributor of free software stuff faces: copyright and license info has no markup language, and indeed is often out of date, which causes some legal risk. Fossbazaar.org and others are trying to come up with a format that everyone can use and that hopefully most people from upstreams to Linux distros to others will adopt. DEP5 was mentioned.
  • Lana Brindley: Creating Beautiful Documentation. The time slot had been shortened, but good stuff anyway. While I haven't personally done much documentation writing since leaving the Linux Documentation Project in 1997, apart from a manual page every now and then, I agree with Brindley that good documentation is an important factor in a successful project. Tech writers and graphical artists are sorely needed, as is shaping projects so that coders are no longer kings.
  • Scott James Remnant: Cutting down boot times. Missed this talk, but that's OK, Scott seems to have missed it, too, due to travel.
  • Carl Worth: Cairo Graphics - Intro and Future thoughts. Another overview talk. I know very little about Cairo, but at least I now know where it stands in the stack. I should perhaps look into using it for Dimbola. If only I knew any graphics programming.

The conference venue works well, except for occasional wireless problems.

Attempted to see how long my X200s battery actually lasts, and I managed to get through the day without charging. When I left the hotel, the battery was fully charged, and when I came back, there was an estimated 15 minutes left. However, I didn't use the laptop all the time, and I can't figure out from the GNOME Power Manager how much battery time I've actually used up today. The history dialog is entirely incomprehensible to me.

One thing that happens in conferences, including this LCA, is that people realize they've forgotten a cable or a charger or something, and someone else lends it to them. There's a bit of a shuffle for the lender and borrower to meet. I wonder if it would be too big a hassle for the organizers to set up a "post office": the lender would bring the cable, or whatever, put it in a bag, put their own name and the borrower's name on the bag, and then give it to the reception people to keep. The borrower could then fetch if from the reception whenever is suitable. Maybe this would be too much work and responsibility for the organizers, who are overworked as it is.

The weather is pretty nice. Some rain occasionally, but lots of sunshine, too. Pretty warm. People are very friendly.

Collaborative storytelling with audience voting

As I'm reading Cory Doctorow's Makers novel, I can't help wondering whether it might be possible to write a novel collaboratively. Each participant would write a paragraph per day, and readers could vote paragraphs up and down. It might be interesting to see if a coherent story would eventually emerge.

Obnam command line interface

I have some specific ideas for the command line interface I'm planning for my backup program. I'll be writing a man page for obnam, but before I do that, here's a sketch.

  • obnam backup --store sftp://example.com/~/backups/ $HOME
  • obnam ls --generation latest
  • obnam verify
  • obnam fsck
  • obnam restore --generation latest --to /var/tmp/liw.restore
  • òbnam forget --keep 1h:7d:5w:12m:99y

The backup command should be obvious. I'll make a configuration file so the location of the backup store can be specified there, rather than every time on the command line. Also other arguments, such as the directories to back up.

The ls command lists the contents of a backup generation.

The verify command compares what has been backed up with what is on the hard disk now, reporting differences. If you back up and then immediately verify, you can check that everything got backed up. Verify will also be able to do things like compare randomly selected files (rather than all of them). I am not yet sure exactly how the verification process should happen to make things trustable.

fsck checks that the internal data structures in the backup store are OK.

restore restores.

forget removes old backup generations. It will be able to remove specific generations, or apply a policy such as "keep one hour, seven daily, five weekly, twelve monthly, and lots of yearly generations". It will be cheap to keep lots of generations, since obnam will do heavy de-duplication, at the block level.

Obnam, or once more a backup program

I've decided to resurrect development of my backup program, Obnam. This time I thought I'd babble about it in public as I develop it, rather than try to present the world with a finished product.

I have not been happy with any backup solution I've tried. I have some fairly specific requirements:

  • Backups must be stored either on a local hard disk, or online. I don't care at all about tapes, optical media, or anything else that requires repetitive manual work.
  • Server end must be under my control as well. No Amazon S3 for me.
  • Both push and pull backups.
  • Backups must be encrypted at client end.
  • Backups must be incremental, but each generation must look like a full snapshot.
  • Backups must use checkpoints: network connections break, and if they do, the next backup must continue from most recent checkpoint.
  • Setup must be easy. Backups are important, but if they're at all any kind of pain, I and most others will just postpone them to a future day and one day it will be too late.
  • Fast. If I do some e-mail and write some code while drinking a smoothie in a net cafe, by the time I finish the drink and put away the laptop the backup must be finished.
  • Deals sensibly both with slow and fast networks. An incremental backup should not download any data from server, and should only upload the delta from the previous backup, plus minimal overhead.
  • Reliable. Backups should not require attention. I should just be allowed to assume they work. This also requires unobtrusive feedback that they're OK, and proper error reporting when something is wrong and does require my attention.

It's been a while since I did a proper survey, so things may have changed since, but so far, I've never found a system that I like. If you know of one, please don't tell me. I am now deep into thinking about the technical problems I will need to solve, and not that interested in finding an existing solution anymore.

If "hubris" was spelled with an i, it would be my middle name.

I have some code sketched out, but nothing that does anything useful yet. I've been playing with the internal architecture, and the interface and abstraction I will want for the "storage subsystem" that stores the backed up data. I have not decided yet how to implement the storage subsystem, but btrfs B-trees interest me a lot.

On free software parenting

I believe Jeff Atwood is fundmentally wrong in his recent blog article Responsible Open Source Code Parenting.

In his article, Atwood bases all criticism of John Gruber's behavior with regards to Markdown on this premise:

As Markdown's "parent", John has a few key responsibilities in shepherding his baby to maturity. Namely, to lead. To set direction.

When someone releases some free software, they have no obligation whatsoever to do anything with or for it again. No legal obligations, and no moral ones. Unless there is some kind of explicit contract the author is free to forget anything ever happened.

It's obviously nice if the author assumes responsibility for further development and leadership and whatever, but it has to happen voluntarily or be compensated.

This is an important difference from the proprietary world Atwood is more familiar with. With proprietary software, the user is pretty much always a customer, and a customer has rights, one of them being that the software works and problems get fixed. With free software, the user is a receiver of a surprise gift.

Now, as far as the Markdown situation is concerned, the facts seem to be that Gruber does not develop the specification or reference implementation, but other people would like things to improve.

In the free software world, the best thing to do in this scenario is to gather the people who want to make improvements, have them collaborate and take over development, thank Gruber, and make the world a better place. Or, vigorously waving a spatula, fork, fork, fork.

There's a whole bunch of people using and relying on Markdown now. Atwood's Stack Overflow site is one of the prominent ones. There's implementations for many programming languages. There's other sites using Markdown. All of these people could (I am reluctant to say should) start a "Markdown Foundation", so to speak, and get to work.

bzr OO.o support?

In re Brian Kuhn's OO.o and version control troubles.

It'd be cool to extend bzr with some plugins to allow it to handle or show differences for specific file types in ways that are useful for those. For example, OpenOffice.org Writer files are binary blobs, and it's difficult to diff them at the moment. However, it would be possible to write something that extracts the text from the blob, and shows something like a wdiff.

This could be extended to all sorts of file formats. An image diff might generate a new image that colors changed parts with red. A source code diff might understand the language and work at the level of semantic language elements.