This post is part of a series on backup software implementation. See
the backup-impl
tag for a list of all posts in
the series.
For this part I have not been able to allocate enough time or energy to do deep thinking, so I'm going to list some ideas that I think are important. I may return to them later.
Content sensitive chunking
My understanding is that both Borg backup and Restic split files into chunks in a way that finds duplicate chunks regardless of where in a file the chunk is. This can make de-duplication much more efficient.
I think the basic approach is to compute a weak checksum for every byte, and when the low N bits of the checksum value are zeroes, that's the end of the chunk.
I've not implemented this myself, but I hear it works well.
I don't know of any research into how well it works. I'd be interested in reading about what checksum algorithm, with what value of N, works best for which type of data. If nobody has researched this yet, I think it'd make an interesting topic of a BSc or MSc. (If you know of such research, I would appreciate a pointer!)
Real time backups
In an ideal situation, backups can happen while a computer is in use. If you are in a meeting, or just working at your desk, or in a cafe, backups happen while you work. When you're read to leave, you suspend or turn off your computer and any work you've done is already backed up.
There are many technical problems to solve to achieve this, but it's an interesting goal.
Read time restores
When you need to restore all of a backup, such as when setting up a new computer, the process can take a very long time. In a calm, serene situation, it's easy to wait for that to happen. It's an opportunity to have some tea, and contemplate the beauty of a flower. For those who need to restore their data to prevent the apocalypse, it would be convenient to be able to start using the computer as soon as possible.
This can be achieved by at least two different approaches:
- First restore the bits that you need now, then let the rest be restored at leisure. This would be fairly simple to implement, but requires knowing what you need first.
- Have a way to use the backed up data without restoring it, such as by mounting the backup as an external disk. This is again fairly simple to implement, for read-only use.
For read-write normal use, a more sophisticated and complicated (and thus fragile and error prone) approach could be developed: an overlay file system on top of the new computer, mounted on top of the file system where data is being restored. When you use a file that's not yet restored, it gets read from the backup. If it has already been restored, it's served from the local disk. If you write to a file, it's done in a copy-on-write manner. Any file you use gets bumped to the head of the restore queue.
The happy scenario is this:
- You get a new laptop.
- You install an operating system. For myself I've managed to automate this and make it fast, as little as five minutes, for a minimal installation.
- Start the process of restoring all the data in you home directory.
- Log in, start using the computer normally. The restore happens in the background without bothering your use of the computer.
- From getting the laptop to being able to use it takes only a few minutes.
This is probably quite difficult to implement. I don't expect to even think about it any time soon, but it's an interesting problem.
Backup server with mutually distrusting users
Imagine Alice and Bob, who have a deep, mutual hatred of each other. They would both gladly do some things to annoy the other, or to cause the other to lose data, or access to their backups.
Can they trust the same backup server? Under what conditions? Can they, even, share backed up data, without opening an attack vector for the other?
I don't know. This is, again, a interesting problem. I'm not going to think about it until I have thought about implementing backups for people who have mutual trust.
Trusting a backup server
Speaking of trust, if you trust all other users of a backup server, but you don't run the server yourself, how much and in what ways do you have to trust the server and its operator?
I think the following are going to be necessary at minimum:
- You trust that the server doesn't remove data on its own authority. Ideally, the backup software can verify that the backup storage contains all the backups that the user expects it to contain, but this is tricky to achieve in a scenario where the user has lost everything, except access to the backup storage.
- You trust that the server or backup storage is available when you need it to be, to make a new backup or to restore data.
There may be more, but that's what I can think of so far.
I think the following don't require trusting the server:
- The server doesn't modify backups stored on it. This is easy to guard against using encryption.
- The server doesn't inspect or leak backed up data. Again, encryption guards against this. (Specifically client-side encryption).
Security is difficult, but important. Ideally, I'd develop threat models and such for this, but we'll see. I'm not a security expert. But this is where my thinking on this is currently.
Feedback
I'm not looking for suggestions on what backup software to use. Please don't suggest solutions.
I would be happy to hear other people's thoughts about backup software implementation. Or what needs and wants they have for backup solutions.
If you have any feedback on this post, please post them in the fediverse thread for this post.