Feed for Planet Debian.
The purpose of a backup is to allow you to recover from a disaster with reasonable cost and effort. If you delete a file you shouldn't have, or make changes that you shouldn't have, backups are meant to save you from having to re-create the file, or undo a large amount of steps.
Speaking very broadly, any copy of your live data is a backup, but this is a uselessly broad definition. For example, if you use an automatic synchronisation system such as Dropbox or git-annex, to keep your live data in sync between two computers, you could pretend they're backups of each other. However, unless the synchronisation also allows you to keep a history of file versions, it's not a very good backup. If you delete your precious file on one computer, and it gets then deleted on the other computer as well, automatically, perhaps in seconds, then the backup is not of much use.
Another common assumption is that a RAID array works as a backup. RAID is an excellent technology that allows you to combine several hard disks so that they protect you against loss of data in case of disk failure. If one disk fails, the others have enough data to re-create the data on the failed disk, using either full copies (RAID-1) or error correction codes (RAID-5, RAID-6). This is not a backup. It doesn't protect you against accidental file deletions. There is also no backup history.
A version control system is very much like a backup. It stores copies
of many of the versions of your project. However, in most version
control systems it's fairly easy to make changes that lose history. Ask
anyone who has used git reset to change the tip of the master branch
to undo a wrong commit or merge, and then accidentally force-pushed
that to the server. This is arguably a normal, if uncommon use of the
version control system. A good backup system will protect you from you
own mistakes, when you do the kinds of things you're expected to do.
Version control systems also rarely capture all your data.
When you were five, and made some stuff on the family computer, and saved it on a floppy, and then drew a cute little picture of yourself on the floppy to make it clear to everyone it was your floppy, and not anyone else's, certainly not your bully of your brother's, and your mother kept the floppy for decades because of the cute picture, then that is also not a backup. You didn't even know your Mom had kept it.
A reasonable backup is one from which you can restore a working copy of your data, when you need to, without too much effort or expense, compared to the disaster you're experiencing. If the disaster is that you deleted a one-page draft outline of the book you want to write someday, the disaster is not very severe. The cost of restoring should be low.
If the disaster is that your plans to become the supreme emperor of the world, and make all people your slaves, are in a spreadsheet on your laptop, and your minions accidentally drove a car over your laptop, and you had accidentally not used a Thinkpad as your laptop, the disaster is quite severe. Unless you recover the spreadsheet, you'll never be able to tell apart the buttons to launch the Moon rocket, to self-destruct your HQ, and to switch channels on your TV, and all your work will be in vain, and you'll never, ever, ever convince the pretty girl with red hair living in the house opposite that she should be interested in you. Also, you'll never be able to move away from your parent's house. So, quite severe. It will be acceptable to go to quite some effort and expense to recover that spreadsheet. It's better if you don't need to, but you will, if you have to.
Your backup should also be reasonably up to date. Backing up every Christmas is a fine family tradition, but if you don't make a backup also on Easter, Midsummer, and Aunt Agatha's birthday sometime in September was it, or maybe October, you'll risk losing a whole year's worth of work. A year is a long time, and you might never be able to re-do all the work.
Personally, I backup my personal laptop every day to a file server at home, and less often to an online backup server. My work laptop gets backed up once an hour to the company file server, which gets backed up to two backup servers about once a day.
You need to balance the risk of losing data and work, and the expense and effort to back up your data. How much is a day's work worth to you, or your employer? How much does a backup system cost?
In the next episode, I'll ponder on how many backups are enough.
... backups? did someone talk about backups? I'm sure I heard someone mention backups here somewhere. Backups! BACKUPS! BACKUPS ARE AWESOME!
That's a direct quote from my recent IRC history. I find backups quite interesting, particularly from an implementation point of view, and I may sometimes obsess about them a little bit. This is why I've written my own backup software.
I'm unusual: most people find backups boring at best, and tedious most of the time. When I talk with people about backups, the usual reaction is "um, I know I should". There are a lot of reasons for this. One is that backups are a lot like insurance: you have to spend time, effort, money, up front, to have any use for them. Another is that the whole topic is scary: you have to think about when things go wrong, and that puts people off. A third reason is that while there are lots of backup tools and methods, it's not always easy. After all, backups are about answering the question, "what can I do to keep my data safe, whatever happens in the future?".
I've spent a fair bit of the past several years thinking about backups. This is the first in a series of blog posts about backups, where I share my thinking about the topic. Perhaps it can be of some use to others. At least people will poke holes in my delusions.
In this post, I'll define some terminology. I am not a backup scholar, and I may have invented some of these terms myself, and they may be different from what real sysadmins use. I'll define the words I use, so we can understand each other.
Live data is the data you work with or keep. It's the files on your hard drive: the documents you write, the photos you save, the unfinished novels you wish you'd finish.
A backup is a spare copy of your live data. If you lose some or all of your live data, you can get it back ("restore") from your backup. The backup copy is, by practical necessity, older than your live data, but if you made the backup recently enough, you won't lose much.
Sometimes it's useful to have more than one old backup copy of your live data. You can have a sequence of backups, made at different times, giving you a backup history. Each copy of your live data in your backup history is a generation. This lets you retrieve a file you deleted a long time ago, but didn't realise you needed until now. If you only keep one backup version, you can't get it back, but if you keep, say, a daily backup for a month, you have a month to realise you need it, before it's lost forever.
The place your backups are stored is the backup repository. You can use many kinds of backup media for backup storage: hard drives, tapes, optical disks (DVD-R, DVD-RW, etc), USB flash drives, online storage, etc. Each type of medium has different characteristics: size, speed, convenicence, reliability, price, which you'll need to balance for a backup solution that's reasonable for you.
You may need multiple backup repositories or media, with one of them located off-site, away from where your computers normally live. Otherwise, if you house burns down, you'll lose all your backups too.
You need to verify that your backups work. It would be awkward to go to the effort and expense of making backups and then not be able to restore your data when you need to. You may even want to test your disaster recovery by pretending that all your computer stuff is gone, except for the backup media. Can you still recover? You'll want to do this periodically, to make sure your backup system keeps working.
Most live data is precious in that you'll be upset if you lose it. Some live data is not precious: your web browser cache probably isn't, for example. This distinction can let you limit the amount of data you need to back up, which can significantly reduce your backup costs.
There is a very large variety of backup tools. They can be very simple and manual: you can copy files to a USB drive using your file manager, once a blue moon. They can also be very complex: enterprise backup products that cost huge amounts of money and come with a multi-day training package for your sysadmin team, and which require that team to function properly.
You'll need to define a backup strategy to tie everything together: what live data to back up, to what medium, using what tools, what kind of backup history to keep, and how to verify that they work.
That's the groundwork. In the next episode, I'll blather about what is a backup, and what isn't.
Almost every bug tracker for free software projects has a section for wishlist bugs. Often this results in ever growing lists of wishes, most of which will never be fulfilled.
A long list of bugs, even if it is wishlist bugs, is rarely useful. It's hard to keep track of what is there, and so the information is not kept up to date when things change. Even if a wishlist bug gets implemented, the bug is often overlooked, and remains on the list.
Joel Spolsky calls this a software inventory. Carrying a large inventory has costs, and usually results in slowed-down development. It increases the friction of doing things in a project.
I have some quite old wishlists for my own projects. I have even more wishlists hidden in my own GTD system. I am not happy about either, and I'm going to have to do something about that.
Just closing all wishlist bugs immediately would be bad manners. Keeping them open indefinitely, just in case someone will some day decide to look through the list in order to hack on something, is wishful thinking: that rarely happens.
So I'm thinking of a compromise: keep wishlist bugs open for a while, perhaps a few months, and if nothing is happening to them, close them with an apology.
There is no common way to express license information in each source file at the moment. Some people embed license information in each file, others keep it in README or another file at the top of the source tree.
Worse, there is no common syntax to express the license information in a machine-parseable way. If we had this, we could have tools that, for example, tell you if you're trying to merge code that has an incompatible license.
Obviously, this kind of thing can never work perfectly. People keep inventing new licenses, and it is not possible for a computer program to fully understand any license. It is not clear humans can do that, either. However, it would be possible to do it to a number of well-known licenses, which would help most of the time. A classic 20/80 situation.
I would like to suggest a syntax, similar to Emacs's "Hey Emacs" modelines, for embedding a summary of the license, or licenses, for a source file, in such a way that it can be programmatically extracted and parsed and analysed. With this syntax, one could then write a tool to ensure that all files in a project have the same license, or that all licenses are compatible with the project's overall license.
Such a tool will, of course, rely on heuristics and assumptions. For example, it needs to assume the machine-parseable license summary is correct, and rely on a ruleset on what licenses are compatible with what licenses. Things can go wrong. That's life. Remember, this is aiming at doing the 20% of the work that will work 80% of the time, not perfection.
I don't have a tool written, but I have a suggestion for the syntax.
/*
* Copyright 2013 Lars Wirzenius
*
* tl;dr =*= Licenses: GPL-3+ or Expat, and Artistic =*=
*
* Blah. Blah. Blah. Imagine long, boring license texts
* here.
*/
The important part is this:
=*= Licenses: GPL-3+ or Expat, and Artistic =*=
The =*= prefix and suffix and the word Licenses are there to
make grepping reasonably reliable without too many false positives,
and to allow comment characters and other text on the same line.
The actual license summary follows the syntax and semantics of the Debian copyright-format 1.0 specification, which I chose because it exists and has had a fair bit of review so far, and is reasonably expressive.
The license summaries can be extracted with the following GNU sed invocation:
sed -n '/.*=\*= [Ll]icen[cs]es\?: \(.*\)=\*=.*/s//\1/p'
I allowed various forms of the word license, since it's a word that a lot of people will get wrong, and it's easy to catch all four common forms.
So, does anyone else think this might be useful? Would you use it in your own projects?
I've made two new designs for Trunk Tees, my Cafepress store.
Thank you to Richard Braakman for suggesting the .* one.
Here's all the older designs as well:
I've just pushed out the release files for Obnam version 1.4, my backup application, and Larch, my B-tree library, which Obnam uses. They are available via my home page (http://liw.fi/). Since Debian is frozen, I am not uploading packages to Debian, but .deb files are available from my personal apt repository for intrepid explorers. (I will be uploading to Debian again after the freeze. I am afraid I'm too lazy to upload to experimental, or do backports. Help is welcome!)
From the Obnam NEWS file:
- The
`lscommand now takes filenames as (optional) arguments, instead of a list of generations. Based on patch by Damien Couroussé. - Even more detailed progress reporting during a backup.
- Add --fsck-skip-generations option to tell fsck to not check any generation metadata.
- The default log level is now INFO, instead of DEBUG. This is to be considered a quantum leap in the continuing rise of the maturity level of the software. (Actually, the change is there just to save some disk space and I/O for people who don't want to be involved in Obnam development and don't want to have massive log files.)
- The default sizes for the
lru-sizeandupload-queue-sizesettings have been reduced, to reduce the memory impact of Obnam. obnam restorenow reports transfer statistics at the end, similarly to whatobnam backupdoes. Suggested by "S. B.".
Bug fixes:
- If listing extended attributes for a filesystem that does not support them, Obnam no longer crashes, just silently does not backup extended attributes. Which aren't there anyway.
- A bug in handling stat lookup errors was fixed. Reported by
Peter Palfrader. Symptom:
AttributeError: 'exceptions.OSError' object has no attribute 'st_ino'in an error message or log file. - A bug in a restore crashing when failing to set extended attributes on the restored file was fixed. Reported by "S. B.".
- Made it clearer what is happening when unlocking the repository due to errors, and fixed it so that a failure to unlock is also an error. Reported by andrewsh.
- The dependency on Larch is now for 1.20121216 or newer, since that is needed for fsck to work.
- The manual page did not document the client name arguments to the
add-keyandremove-keysubcommands. Reported by Lars Kruse. - Restoring symlinks as root would fail. Reported and fixed by David Fries.
- Only set ssh user/port if explicitily requested, otherwise let ssh select them. Reported by Michael Goetze, fixed by David Fries.
- Fix problem with old version of paramiko and chdir. Fixed by Nick Altmann.
- Fix problems with signed vs unsigned values for struct stat fields. Reported by Henning Verbeek.
I've just released, to code.liw.fi, version 1.20130313 of cliapp, my Python framework for Unix-like command line programs. It contains the typical stuff such programs need to do, such as parsing the command line for options, and iterating over input files.
Version 1.20130313
- Add
cliapp.Application.compute_setting_valuesmethod. This allows the application to have settings with values that are computed after configuration files and the command line are parsed. - Cliapp now logs the Python version at startup, to aid debugging.
cliapp.runcmdnow logs much less during execution of a command. The verbose logging was useful while developing pipeline support, but has now not been useful for months.- More default settings and options have an option group now, making
--helpoutput prettier. - The
--helpoutput and the output of thehelpsubcommand now only list summaries for subcommands. The full documentation for a subcommand can be seen by giving the name of the subcommand tohelp. - Logging setup is now more overrideable. The
setup_loggingmethod callssetup_logging_handler_for_syslog,setup_logging_handler_for_syslog, orsetup_logging_handler_to_file, and the last one callssetup_logging_formatandsetup_logging_timestampto create the format strings for messages and timestamps. This allows applications to add, for example, more detailed timestamps easily. - The process and system CPU times, and those of the child processes, and the process wall clock duration, are now logged when the memory profiling information is logged.
- Subcommands added with
add_subcommandmay now have aliases. Subcommands defined usingApplicationclass methods namedcmd_*cannot have aliases. - Settings and subcommands may now be hidden from
--helpandhelpoutput. New option--help-alland new subcommandhelp-allshow everything. - cliapp(5) now explains how
--generate-manpageis used. Thanks to Enrico Zini for the suggestion. - New function
cliapp.ssh_runcmdfor executing a command remotely over ssh. The function automatically shell-quotes the argv array given to it so that arguments with spaces and other shell meta-characters work over ssh. - New function
cliapp.shell_quotequotes strings for passing as shell arguments. cliapp.runcmdnow has a new keyword argument:log_error. If set to false, errors are not logged. Defaults to true.
Bug fixes:
- The process title is now set only if
/proc/self/commexists. Previously, on the kernel in Debian squeeze (2.6.32), setting the process title would fail, and the error would be logged to the terminal. Reported by William Boughton. - A setting may no longer have a default value of None.
I've just released version 0.22 of ttystatus, my Python library for showing progress reporting and status updates on terminals, for (Unix) command line programs. Output is automatically adapted to the width of the terminal: truncated if it does not fit, and re-sized if the terminal size changes.
Available on code.liw.fi only, due to the Debian freeze.
Version 0.22, released 2013-03-12
- When the terminal size changes, ttystatus will now update the display at once.
Daniel bought a Trunk Tees "Happiness is a depilated yak" hoodie. See the picture! (Won't embed it, since it's on Google Plush, sorry.)
Daniel told me how to set up scroll wheel emulation on a Thinkpad. Presumably it works on other systems as well, but the magic numbers for xinput may be different. I wrote a script to set it up, and configure my GNOME session to run it upon login.
The result: pressing the middle mouse button and moving the trackpoint up/down or left/right results in the applications receiving events as if a scroll wheel had been used.
Very, very handy. Thanks, Daniel.
#!/bin/sh
set -eu
id=$(xinput list | sed -n '/TPPS\/2 IBM TrackPoint/s/.*id=\([0-9]\+\).*/\1/p')
emu=$(xinput list-props "$id" |
sed -n '/Evdev Wheel Emulation (/s/.*(\([0-9]\+\)).*/\1/p')
but=$(xinput list-props "$id" |
sed -n '/Evdev Wheel Emulation Button (/s/.*(\([0-9]\+\)).*/\1/p')
axs=$(xinput list-props "$id" |
sed -n '/Evdev Wheel Emulation Axes (/s/.*(\([0-9]\+\)).*/\1/p')
xinput set-int-prop "$id" "$emu" 8 1
xinput set-int-prop "$id" "$but" 8 2
#xinput set-int-prop "$id" "$axs" 8 6 7 4 5