Very high level architectural assumptions
Feedback

This post is part of a series on backup software implementation. See the backup-impl tag for a list of all posts in the series.

Very high level architectural assumptions

The following assumptions of the software architecture of a backup system are less firmly set in stone than the table stakes in my first post. However, having spent two decades thinking about this, they make sense to me. If you think I'm wrong, feel free to tell me how and why (see end for how).

Backed up data is split into chunks of a suitable size. This makes de-duplication simple: by splitting files into chunks in just the right way, identical data that occurs in multiple places can be stored only once in the backup. The simplest example for this is when a file is renamed, but not otherwise modified. An sensible backup system will notice the rename, only stores the new name, not all the data in the file all over again.
- De-duplication can be done at quite a fine-grained granularity or a coarse one. There are a number of approaches here. At this high-level of architecture thinking, we don't need to care how the splitting into chunks happen. We do need to take care that the size of chunks can vary and that the backup storage can't care about the specifics of chunk splitting.
- There are ways to do "content sensitive" chunk splitting so that the same bit of data is recognized as a chunk event if it's preceded by other data. This is exciting, but I don't know of any research about how much this actually finds duplicate data in real data sets. A flexible backup system might need to support many ways to split data into chunks, so that the optimal method is used for each subset of the precious data being backed up.
- I note that the finest possible granularity here is the bit, but it would be ridiculous to that far. However the backup system is implemented, each chunk is going to incur some overhead, and if the chunks are too small, even the slightest overhead is going to be too much. A backup system needs to strike a suitable balance here.
To achieve de-duplication, the backup system needs a way to detect that two chunks are identical. A popular way to do this is to use cryptographically secure checksum, or hash, such as SHA3. An important feature of them is that if two chunks have the same hash, they are almost certainly identical in content (if the hashes are different, the chunks are absolutely certain to be different). It can be much more efficient to compute and compare hashes than to retrieve and compare chunk data. This is probably good for most people most of the time.
- However, for the people who do research into hash function collisions, it's not good enough. It makes a sad researcher who spends a century of CPU time to create a hash collision, then makes a backup of the generated data, and when restoring their data finds out that the backup system decided that the two files with the same checksum were in fact identical.
- A backup system could make this configurable, possibly on a per-directory basis. A hash collision researcher can mark the directory where they store hash collisions as "compare to de-duplicate".
- I admit this is a very rare use case, but it preys on my mind. At this level of software architectural thinking, the crucial point is whether to make the backup system use content hashes as the only chunk identifiers, or if chunk identifiers should be independent of the content.
I really like the SSH protocol and its SFTP sub-system for data transfer. I don't particularly like it for accessing a backup server. The needs of a backup system are sufficiently different from the needs of generic remote file transfer and file system access that I don't recommend using SFTP for this. For example, it's tricky to set up an SFTP server that allows a backup client to make a new backup, or to restore an existing backup, but does not it allow deleting a backup. It makes more sense to me to build a custom HTTP API for the backup server.
- It seems important to me that one can authorize ones various devices to make new backups automatically, but not allow them to delete old backups. This mitigates the situation where a device is compromised. A compromised client can't destroy data that has been backed up, even if can make new backups with non-sense or corrupt data.

Feedback

I'm not looking for suggestions on what backup software to use. Please don't suggest solutions.

I would be happy to hear other people's thoughts about backup software implementation. Or what needs and wants they have for backup solutions.

If you have any feedback on this post, please post them in the fediverse thread for this post.