This post is part of a series on backup software implementation. See
the backup-impl
tag for a list of all posts in
the series.
Very high level architectural assumptions
The following assumptions of the software architecture of a backup system are less firmly set in stone than the table stakes in my first post. However, having spent two decades thinking about this, they make sense to me. If you think I'm wrong, feel free to tell me how and why (see end for how).
- Backed up data is split into chunks of a suitable size. This makes
de-duplication simple: by splitting files into chunks in just the
right way, identical data that occurs in multiple places can be
stored only once in the backup. The simplest example for this is
when a file is renamed, but not otherwise modified. An sensible
backup system will notice the rename, only stores the new name, not
all the data in the file all over again.
- De-duplication can be done at quite a fine-grained granularity or a coarse one. There are a number of approaches here. At this high-level of architecture thinking, we don't need to care how the splitting into chunks happen. We do need to take care that the size of chunks can vary and that the backup storage can't care about the specifics of chunk splitting.
- There are ways to do "content sensitive" chunk splitting so that the same bit of data is recognized as a chunk event if it's preceded by other data. This is exciting, but I don't know of any research about how much this actually finds duplicate data in real data sets. A flexible backup system might need to support many ways to split data into chunks, so that the optimal method is used for each subset of the precious data being backed up.
- I note that the finest possible granularity here is the bit, but it would be ridiculous to that far. However the backup system is implemented, each chunk is going to incur some overhead, and if the chunks are too small, even the slightest overhead is going to be too much. A backup system needs to strike a suitable balance here.
- To achieve de-duplication, the backup system needs a way to detect
that two chunks are identical. A popular way to do this is to use
cryptographically secure checksum, or hash, such as
SHA3. An important feature of
them is that if two chunks have the same hash, they are almost
certainly identical in content (if the hashes are different, the
chunks are absolutely certain to be different). It can be much more
efficient to compute and compare hashes than to retrieve and compare
chunk data. This is probably good for most people most of the time.
- However, for the people who do research into hash function collisions, it's not good enough. It makes a sad researcher who spends a century of CPU time to create a hash collision, then makes a backup of the generated data, and when restoring their data finds out that the backup system decided that the two files with the same checksum were in fact identical.
- A backup system could make this configurable, possibly on a per-directory basis. A hash collision researcher can mark the directory where they store hash collisions as "compare to de-duplicate".
- I admit this is a very rare use case, but it preys on my mind. At this level of software architectural thinking, the crucial point is whether to make the backup system use content hashes as the only chunk identifiers, or if chunk identifiers should be independent of the content.
- I really like the SSH
protocol and its
SFTP
sub-system for data transfer. I don't particularly like it for
accessing a backup server. The needs of a backup system are
sufficiently different from the needs of generic remote file
transfer and file system access that I don't recommend using SFTP
for this. For example, it's tricky to set up an SFTP server that
allows a backup client to make a new backup, or to restore an
existing backup, but does not it allow deleting a backup. It makes
more sense to me to build a custom HTTP API for the backup server.
- It seems important to me that one can authorize ones various devices to make new backups automatically, but not allow them to delete old backups. This mitigates the situation where a device is compromised. A compromised client can't destroy data that has been backed up, even if can make new backups with non-sense or corrupt data.
Feedback
I'm not looking for suggestions on what backup software to use. Please don't suggest solutions.
I would be happy to hear other people's thoughts about backup software implementation. Or what needs and wants they have for backup solutions.
If you have any feedback on this post, please post them in the fediverse thread for this post.