Feed for Planet Debian.

The purpose of a backup is to allow you to recover from a disaster with reasonable cost and effort. If you delete a file you shouldn't have, or make changes that you shouldn't have, backups are meant to save you from having to re-create the file, or undo a large amount of steps.

Speaking very broadly, any copy of your live data is a backup, but this is a uselessly broad definition. For example, if you use an automatic synchronisation system such as Dropbox or git-annex, to keep your live data in sync between two computers, you could pretend they're backups of each other. However, unless the synchronisation also allows you to keep a history of file versions, it's not a very good backup. If you delete your precious file on one computer, and it gets then deleted on the other computer as well, automatically, perhaps in seconds, then the backup is not of much use.

Another common assumption is that a RAID array works as a backup. RAID is an excellent technology that allows you to combine several hard disks so that they protect you against loss of data in case of disk failure. If one disk fails, the others have enough data to re-create the data on the failed disk, using either full copies (RAID-1) or error correction codes (RAID-5, RAID-6). This is not a backup. It doesn't protect you against accidental file deletions. There is also no backup history.

A version control system is very much like a backup. It stores copies of many of the versions of your project. However, in most version control systems it's fairly easy to make changes that lose history. Ask anyone who has used git reset to change the tip of the master branch to undo a wrong commit or merge, and then accidentally force-pushed that to the server. This is arguably a normal, if uncommon use of the version control system. A good backup system will protect you from you own mistakes, when you do the kinds of things you're expected to do. Version control systems also rarely capture all your data.

When you were five, and made some stuff on the family computer, and saved it on a floppy, and then drew a cute little picture of yourself on the floppy to make it clear to everyone it was your floppy, and not anyone else's, certainly not your bully of your brother's, and your mother kept the floppy for decades because of the cute picture, then that is also not a backup. You didn't even know your Mom had kept it.

A reasonable backup is one from which you can restore a working copy of your data, when you need to, without too much effort or expense, compared to the disaster you're experiencing. If the disaster is that you deleted a one-page draft outline of the book you want to write someday, the disaster is not very severe. The cost of restoring should be low.

If the disaster is that your plans to become the supreme emperor of the world, and make all people your slaves, are in a spreadsheet on your laptop, and your minions accidentally drove a car over your laptop, and you had accidentally not used a Thinkpad as your laptop, the disaster is quite severe. Unless you recover the spreadsheet, you'll never be able to tell apart the buttons to launch the Moon rocket, to self-destruct your HQ, and to switch channels on your TV, and all your work will be in vain, and you'll never, ever, ever convince the pretty girl with red hair living in the house opposite that she should be interested in you. Also, you'll never be able to move away from your parent's house. So, quite severe. It will be acceptable to go to quite some effort and expense to recover that spreadsheet. It's better if you don't need to, but you will, if you have to.

Your backup should also be reasonably up to date. Backing up every Christmas is a fine family tradition, but if you don't make a backup also on Easter, Midsummer, and Aunt Agatha's birthday sometime in September was it, or maybe October, you'll risk losing a whole year's worth of work. A year is a long time, and you might never be able to re-do all the work.

Personally, I backup my personal laptop every day to a file server at home, and less often to an online backup server. My work laptop gets backed up once an hour to the company file server, which gets backed up to two backup servers about once a day.

You need to balance the risk of losing data and work, and the expense and effort to back up your data. How much is a day's work worth to you, or your employer? How much does a backup system cost?

In the next episode, I'll ponder on how many backups are enough.

Posted Mon Jun 17 18:39:00 2013 Tags:

... backups? did someone talk about backups? I'm sure I heard someone mention backups here somewhere. Backups! BACKUPS! BACKUPS ARE AWESOME!

That's a direct quote from my recent IRC history. I find backups quite interesting, particularly from an implementation point of view, and I may sometimes obsess about them a little bit. This is why I've written my own backup software.

I'm unusual: most people find backups boring at best, and tedious most of the time. When I talk with people about backups, the usual reaction is "um, I know I should". There are a lot of reasons for this. One is that backups are a lot like insurance: you have to spend time, effort, money, up front, to have any use for them. Another is that the whole topic is scary: you have to think about when things go wrong, and that puts people off. A third reason is that while there are lots of backup tools and methods, it's not always easy. After all, backups are about answering the question, "what can I do to keep my data safe, whatever happens in the future?".

I've spent a fair bit of the past several years thinking about backups. This is the first in a series of blog posts about backups, where I share my thinking about the topic. Perhaps it can be of some use to others. At least people will poke holes in my delusions.

In this post, I'll define some terminology. I am not a backup scholar, and I may have invented some of these terms myself, and they may be different from what real sysadmins use. I'll define the words I use, so we can understand each other.

Live data is the data you work with or keep. It's the files on your hard drive: the documents you write, the photos you save, the unfinished novels you wish you'd finish.

A backup is a spare copy of your live data. If you lose some or all of your live data, you can get it back ("restore") from your backup. The backup copy is, by practical necessity, older than your live data, but if you made the backup recently enough, you won't lose much.

Sometimes it's useful to have more than one old backup copy of your live data. You can have a sequence of backups, made at different times, giving you a backup history. Each copy of your live data in your backup history is a generation. This lets you retrieve a file you deleted a long time ago, but didn't realise you needed until now. If you only keep one backup version, you can't get it back, but if you keep, say, a daily backup for a month, you have a month to realise you need it, before it's lost forever.

The place your backups are stored is the backup repository. You can use many kinds of backup media for backup storage: hard drives, tapes, optical disks (DVD-R, DVD-RW, etc), USB flash drives, online storage, etc. Each type of medium has different characteristics: size, speed, convenicence, reliability, price, which you'll need to balance for a backup solution that's reasonable for you.

You may need multiple backup repositories or media, with one of them located off-site, away from where your computers normally live. Otherwise, if you house burns down, you'll lose all your backups too.

You need to verify that your backups work. It would be awkward to go to the effort and expense of making backups and then not be able to restore your data when you need to. You may even want to test your disaster recovery by pretending that all your computer stuff is gone, except for the backup media. Can you still recover? You'll want to do this periodically, to make sure your backup system keeps working.

Most live data is precious in that you'll be upset if you lose it. Some live data is not precious: your web browser cache probably isn't, for example. This distinction can let you limit the amount of data you need to back up, which can significantly reduce your backup costs.

There is a very large variety of backup tools. They can be very simple and manual: you can copy files to a USB drive using your file manager, once a blue moon. They can also be very complex: enterprise backup products that cost huge amounts of money and come with a multi-day training package for your sysadmin team, and which require that team to function properly.

You'll need to define a backup strategy to tie everything together: what live data to back up, to what medium, using what tools, what kind of backup history to keep, and how to verify that they work.

That's the groundwork. In the next episode, I'll blather about what is a backup, and what isn't.

Posted Sat Jun 15 10:05:02 2013 Tags:

Almost every bug tracker for free software projects has a section for wishlist bugs. Often this results in ever growing lists of wishes, most of which will never be fulfilled.

A long list of bugs, even if it is wishlist bugs, is rarely useful. It's hard to keep track of what is there, and so the information is not kept up to date when things change. Even if a wishlist bug gets implemented, the bug is often overlooked, and remains on the list.

Joel Spolsky calls this a software inventory. Carrying a large inventory has costs, and usually results in slowed-down development. It increases the friction of doing things in a project.

I have some quite old wishlists for my own projects. I have even more wishlists hidden in my own GTD system. I am not happy about either, and I'm going to have to do something about that.

Just closing all wishlist bugs immediately would be bad manners. Keeping them open indefinitely, just in case someone will some day decide to look through the list in order to hack on something, is wishful thinking: that rarely happens.

So I'm thinking of a compromise: keep wishlist bugs open for a while, perhaps a few months, and if nothing is happening to them, close them with an apology.

Posted Tue May 28 18:01:14 2013 Tags:

There is no common way to express license information in each source file at the moment. Some people embed license information in each file, others keep it in README or another file at the top of the source tree.

Worse, there is no common syntax to express the license information in a machine-parseable way. If we had this, we could have tools that, for example, tell you if you're trying to merge code that has an incompatible license.

Obviously, this kind of thing can never work perfectly. People keep inventing new licenses, and it is not possible for a computer program to fully understand any license. It is not clear humans can do that, either. However, it would be possible to do it to a number of well-known licenses, which would help most of the time. A classic 20/80 situation.

I would like to suggest a syntax, similar to Emacs's "Hey Emacs" modelines, for embedding a summary of the license, or licenses, for a source file, in such a way that it can be programmatically extracted and parsed and analysed. With this syntax, one could then write a tool to ensure that all files in a project have the same license, or that all licenses are compatible with the project's overall license.

Such a tool will, of course, rely on heuristics and assumptions. For example, it needs to assume the machine-parseable license summary is correct, and rely on a ruleset on what licenses are compatible with what licenses. Things can go wrong. That's life. Remember, this is aiming at doing the 20% of the work that will work 80% of the time, not perfection.

I don't have a tool written, but I have a suggestion for the syntax.

/*
 * Copyright 2013 Lars Wirzenius
 *
 * tl;dr =*= Licenses: GPL-3+ or Expat, and Artistic =*=
 *
 * Blah. Blah. Blah. Imagine long, boring license texts 
 * here.
 */

The important part is this:

=*= Licenses: GPL-3+ or Expat, and Artistic =*=

The =*= prefix and suffix and the word Licenses are there to make grepping reasonably reliable without too many false positives, and to allow comment characters and other text on the same line.

The actual license summary follows the syntax and semantics of the Debian copyright-format 1.0 specification, which I chose because it exists and has had a fair bit of review so far, and is reasonably expressive.

The license summaries can be extracted with the following GNU sed invocation:

sed -n '/.*=\*= [Ll]icen[cs]es\?: \(.*\)=\*=.*/s//\1/p'

I allowed various forms of the word license, since it's a word that a lot of people will get wrong, and it's easy to catch all four common forms.

So, does anyone else think this might be useful? Would you use it in your own projects?

Posted Wed Apr 10 17:46:14 2013 Tags:

I've made two new designs for Trunk Tees, my Cafepress store.

Thank you to Richard Braakman for suggesting the .* one.

Here's all the older designs as well:

Posted Sun Mar 24 14:55:23 2013 Tags:

I've just pushed out the release files for Obnam version 1.4, my backup application, and Larch, my B-tree library, which Obnam uses. They are available via my home page (http://liw.fi/). Since Debian is frozen, I am not uploading packages to Debian, but .deb files are available from my personal apt repository for intrepid explorers. (I will be uploading to Debian again after the freeze. I am afraid I'm too lazy to upload to experimental, or do backports. Help is welcome!)

From the Obnam NEWS file:

  • The`ls command now takes filenames as (optional) arguments, instead of a list of generations. Based on patch by Damien Couroussé.
  • Even more detailed progress reporting during a backup.
  • Add --fsck-skip-generations option to tell fsck to not check any generation metadata.
  • The default log level is now INFO, instead of DEBUG. This is to be considered a quantum leap in the continuing rise of the maturity level of the software. (Actually, the change is there just to save some disk space and I/O for people who don't want to be involved in Obnam development and don't want to have massive log files.)
  • The default sizes for the lru-size and upload-queue-size settings have been reduced, to reduce the memory impact of Obnam.
  • obnam restore now reports transfer statistics at the end, similarly to what obnam backup does. Suggested by "S. B.".

Bug fixes:

  • If listing extended attributes for a filesystem that does not support them, Obnam no longer crashes, just silently does not backup extended attributes. Which aren't there anyway.
  • A bug in handling stat lookup errors was fixed. Reported by Peter Palfrader. Symptom: AttributeError: 'exceptions.OSError' object has no attribute 'st_ino' in an error message or log file.
  • A bug in a restore crashing when failing to set extended attributes on the restored file was fixed. Reported by "S. B.".
  • Made it clearer what is happening when unlocking the repository due to errors, and fixed it so that a failure to unlock is also an error. Reported by andrewsh.
  • The dependency on Larch is now for 1.20121216 or newer, since that is needed for fsck to work.
  • The manual page did not document the client name arguments to the add-key and remove-key subcommands. Reported by Lars Kruse.
  • Restoring symlinks as root would fail. Reported and fixed by David Fries.
  • Only set ssh user/port if explicitily requested, otherwise let ssh select them. Reported by Michael Goetze, fixed by David Fries.
  • Fix problem with old version of paramiko and chdir. Fixed by Nick Altmann.
  • Fix problems with signed vs unsigned values for struct stat fields. Reported by Henning Verbeek.
Posted Sat Mar 16 19:28:19 2013 Tags:

I've just released, to code.liw.fi, version 1.20130313 of cliapp, my Python framework for Unix-like command line programs. It contains the typical stuff such programs need to do, such as parsing the command line for options, and iterating over input files.

Version 1.20130313

  • Add cliapp.Application.compute_setting_values method. This allows the application to have settings with values that are computed after configuration files and the command line are parsed.
  • Cliapp now logs the Python version at startup, to aid debugging.
  • cliapp.runcmd now logs much less during execution of a command. The verbose logging was useful while developing pipeline support, but has now not been useful for months.
  • More default settings and options have an option group now, making --help output prettier.
  • The --help output and the output of the help subcommand now only list summaries for subcommands. The full documentation for a subcommand can be seen by giving the name of the subcommand to help.
  • Logging setup is now more overrideable. The setup_logging method calls setup_logging_handler_for_syslog, setup_logging_handler_for_syslog, or setup_logging_handler_to_file, and the last one calls setup_logging_format and setup_logging_timestamp to create the format strings for messages and timestamps. This allows applications to add, for example, more detailed timestamps easily.
  • The process and system CPU times, and those of the child processes, and the process wall clock duration, are now logged when the memory profiling information is logged.
  • Subcommands added with add_subcommand may now have aliases. Subcommands defined using Application class methods named cmd_* cannot have aliases.
  • Settings and subcommands may now be hidden from --help and help output. New option --help-all and new subcommand help-all show everything.
  • cliapp(5) now explains how --generate-manpage is used. Thanks to Enrico Zini for the suggestion.
  • New function cliapp.ssh_runcmd for executing a command remotely over ssh. The function automatically shell-quotes the argv array given to it so that arguments with spaces and other shell meta-characters work over ssh.
  • New function cliapp.shell_quote quotes strings for passing as shell arguments.
  • cliapp.runcmd now has a new keyword argument: log_error. If set to false, errors are not logged. Defaults to true.

Bug fixes:

  • The process title is now set only if /proc/self/comm exists. Previously, on the kernel in Debian squeeze (2.6.32), setting the process title would fail, and the error would be logged to the terminal. Reported by William Boughton.
  • A setting may no longer have a default value of None.
Posted Wed Mar 13 22:02:23 2013 Tags:

I've just released version 0.22 of ttystatus, my Python library for showing progress reporting and status updates on terminals, for (Unix) command line programs. Output is automatically adapted to the width of the terminal: truncated if it does not fit, and re-sized if the terminal size changes.

Available on code.liw.fi only, due to the Debian freeze.

Version 0.22, released 2013-03-12

  • When the terminal size changes, ttystatus will now update the display at once.
Posted Wed Mar 13 08:40:05 2013 Tags:

Daniel bought a Trunk Tees "Happiness is a depilated yak" hoodie. See the picture! (Won't embed it, since it's on Google Plush, sorry.)

Posted Tue Mar 12 20:05:26 2013 Tags:

Daniel told me how to set up scroll wheel emulation on a Thinkpad. Presumably it works on other systems as well, but the magic numbers for xinput may be different. I wrote a script to set it up, and configure my GNOME session to run it upon login.

The result: pressing the middle mouse button and moving the trackpoint up/down or left/right results in the applications receiving events as if a scroll wheel had been used.

Very, very handy. Thanks, Daniel.

#!/bin/sh

set -eu

id=$(xinput list | sed -n '/TPPS\/2 IBM TrackPoint/s/.*id=\([0-9]\+\).*/\1/p')
emu=$(xinput list-props "$id" | 
    sed -n '/Evdev Wheel Emulation (/s/.*(\([0-9]\+\)).*/\1/p')
but=$(xinput list-props "$id" | 
    sed -n '/Evdev Wheel Emulation Button (/s/.*(\([0-9]\+\)).*/\1/p')
axs=$(xinput list-props "$id" | 
    sed -n '/Evdev Wheel Emulation Axes (/s/.*(\([0-9]\+\)).*/\1/p')
xinput set-int-prop "$id" "$emu" 8 1
xinput set-int-prop "$id" "$but" 8 2
#xinput set-int-prop "$id" "$axs" 8 6 7 4 5
Posted Mon Mar 11 20:50:51 2013 Tags: