Background

I am a twice failed backup software developer. I'm not currently intending to write new backup software, but I keep thinking about the technical problems in implementing backup software. This is a first in a series of blog posts about that.

In 2004 I set out to implement a new program for doing backups. This eventually became known as "Obnam version 1". (It's initial name was "backup script", or bs on the command line. I have been told I don't understand marketing.)

Obnam 1 was mildly successful, in that for some number of people it provided a backup solution reasonably well. I don't know how many people: I don't add tracking or surveillance to my software. Probably at least hundreds, based on Debian popcon data. Arguably, the biggest achievement of Obnam 1 is that it inspired better competitors.

Obnam 1 was implemented in Python. It did coarse de-duplication, and client-side encryption, and could use an SSH server for storage using SFTP. I retired Obnam 1 in 2017 , after it was no longer fun to work on as a hobby. The software was slow, and making changes was tedious.

In 2020, I realized I couldn't stop thinking about how to implement backup software, and I had recently learned the Rust language, so I started to build Obnam 2, in Rust. It was fun for a while. A couple of years later, I had lost momentum and energy for that, too. Like with Obnam 1, I had made software that kind of worked, but that was tedious to change. I've not officially retired Obnam 2, but that current code base does not seem like something I want to build on as a hobby.

In retrospect, I think the biggest thing that went wrong with both Obnam 1 and 2 is that I rushed to get to a state where I am able to use for my own needs. That I made compromises that later turned out to be hard to undo or change. You might call this "technical debt", though I hope you don't, as I don't like that concept. I prefer to think of this as building a shaky foundation for my backup house, and picking the wrong kind of wood for the roof support beams. Changing any of that would require building a new house, by changing the old house one brick, plan, or nail at a time, while people were living in it. Doable, but not fun in a hobby project.

It's now 2024, and I still can't stop thinking about how to implement backup software. I'm beginning to suspect I may have a little bit of an obsession. At this point, it's probably best for me to concentrate on thinking about the problems, and their possible solutions, rather then actually building software. That's the funnest part of this.

I'm going post my thoughts about this as a series of blog posts here on my personal blog. I don't know how long this will be, nor how frequent. For each post, I'll start a fediverse thread, in case anyone has comments on what I've written. I will also tag each post in this series with backup-impl, and you can subscribe to the RSS or Atom feed for that tag if you want to.

What are backups, anyway?

Backups are actually not important to anyone. What matters is that you can recover your data, after you primary copy of it is corrupted or lost. Restoring is important. Rather than try to get people to adopt new terminology, I'll stick to talking about backups, but I wanted to make this point early on.

I'll use the following terminology in this blog series.

  • Primary copy of your data is the one you work with. It's on your laptop, desktop, server, phone, or other computing device. If you need to look up or modify a document, photo, or whatever, the primary copy is what you use.
  • Backup copy is an independent snapshot of the primary copy at a given time and that you recover you data from in the case of an emergency.
  • Restore is the process of recovering your data from a backup copy.

It's important that the backup copy is independent from the primary copy. This means that, say, a database replica that gets updated automatically whenever the primary database is updated, is not a backup. Likewise, a RAID array is not a backup. Both of these are good for other disaster recovery, but they don't you recover data you've deleted or corrupted.

A copy of the data on the same hard drive can be considered a backup copy, for some disaster scenarios. It protects you against the primary copy being corrupted or deleted, but not against the hard drive failing. It's up to you to decide what threats you want your backups to protect against, and this will inform you of whether you need a backup copy on a different hard drive, in a different computer, on a different continent, or possibly in a different universe.

An important point about backups is that you don't know that you have a valid backup unless and until you have successfully verified that you can restore the data.

Table stakes for backups

When I think about backups, I have a bunch of assumptions that are usually unstated. Unstated assumptions lead to confusing discussion. Here are some of the assumptions I make, made explicit:

  • The user has precious primary data on their device, stored in files in a file system.
    • data that is not precious does not need to be backed up
    • the user decides what is precious for them, the backup system assumes everything is, unless told otherwise
    • I'm not currently concerned about data in memory, on other devices, in actively updated databases, or other such scenarios; they are not unimportant, but out of scope for me, at least for now
  • The user can make a backup to local storage, or a remote server.
    • local storage is anything that the backup software can access via the file system, and is probably something like a USB drive
    • remote server is accessed over the network in some manner, probably using an HTTP API, with the API provided by some backup software component that does also does access control
    • I am not concerned with non-file system storage, such as tapes.
  • Backups are encrypted and authenticated on the client.
    • the backup software can verity, using cryptography, that the backup data it retrieves from backup storage is what was put into the storage and hasn't been modified in between
    • a backup server, if one is used, does not have access to, or care, about the contents in the backed up data
  • Users should not have to trust the backup server more than they have to.
    • they have to inherently trust that the server doesn't delete or corrupt backup intentionally
    • they should not have to trust that the server doesn't snoop on the users, because the backups are encrypted on the client
  • Users who trust each other can share the backup storage in a way that allows them to share backed up data.
    • if Alice and Bob both have a copy of the same large file, and Alice makes a backup of it first, Bob should not have to back it up again
    • this is called "de-duplication", of which I will have more to say later
    • of course, Alice and Bob might just be two devices owned by the same person, instead of being different people, but from the backup software point of view, this seems to like an unimportant distinction
    • this mutual de-duplication only applies to users who opt to trust each other
    • it may be too difficult a problem to design a backup system that allows mutually hostile users to share the backup storage, and I'm not going to try think about that; I'm not even sure it makes sense for mutually hostile people to share backed up data, even if it's an interesting technical problem of how to do that
  • When a user is facing a disaster, their backup system should require them to have as few things as possible to recover. Ideally, the user should know where their backups are, their credentials to access their backup storage and their encryption keys.
    • I do not want to assume the user necessarily has a copy of an encryption key, or an encryption device. Ideally, if the user remembers their backup server, username, and passphrase, they should be to recover their data.
    • However, this is also something that different people have different needs about. The backup system should be able to cater to different needs here.
  • I'm not interested in backup systems that assume a specific file system or storage technology, such as btrfs send. I don't want restores to be tied to using the same file system where the backup was made.
  • I want backups that are independent snapshots, not merely a delta for the previous backup. Deltas become hard to manage and a limit on run time performance of the backup system. Snapshots make it easy to remove specific backups, or to browse them.

These are all things I'm reluctant to change, or to make compromises on. Your assumptions may be different, and that's OK, but this is my blog, and my thought process, and my assumptions apply.

Feedback

This blog post and its possible follow-ups is just me thinking aloud. I make no promises about implementing any of this, ever. I'd very much like to, but I don't know if I will have the time and energy. If I do, I might build something only for myself. However, if you'd like to pay me to build a backup system for you, I'm happy to invoice for my time via my company. (The last sentence was blatant advertising that your ad blocker didn't detect.)

I'm not looking for suggestions on what backup software to use. Please don't suggest solutions.

I would be happy to hear other people's thoughts about backup software implementation. Or what needs and wants they have for backup solutions.

If you have any feedback on this post, please post them in the fediverse thread for this post.