This post is part of a series on backup software implementation. See
the backup-impl
tag for a list of all posts in
the series.
Updates on previous points
I had some useful feedback to my previous two posts.
Hash function
Ed Davies asked why the hash function needs to be cryptographically secure. I realized that I mixed up two things: accidental collisions, which doesn't need security, and attacks, which does. For avoiding accidental collisions, any strong hash function will do, such as MD5.
However, because a backup program can't safely assume the data it operates on is benign, it needs to be secure against malicious data provided by an attacker. Web browsers, local mail user agents, file downloads, etc, are ways in which an attacker may inject malicious data on a user's system. In this context, the malicious data would be data constructed to cause a hash collision with the data that the user has otherwise.
If the backup software only relies on the hash function, the malicious data might prevent valuable data from being backed up. The two ways I know of how to prevent that is to use a cryptographically secure hash function, or to compare data when hashes match. Data comparison can be very expensive, as it requires downloading backed up data from the server. Thus, unless the user is willing to pay to cost of comparison, using a cryptographically secure hash function makes sense.
Storage location
Jonathan McDowell raised the point that where backups are stored can be crucial. In this series of blog posts I've mostly been thinking about how backups are implemented, and ignoring how the storage is provided. The point of the cost is an important one, though. While I'm not willing to think about how to design a backup system that relies on any specific storage provider, it's important that the design of a backup implementation allows the user to choose a way to store and access their backups that suits them.
A backup system that costs too much to use, or is not available when the user needs it, is of no use.
Adam Bark points out that the "backup server API" and the actual backup storage need not reside on the same machine. One might, for example, deploy the API on the local machine, but back up to a storage provider. It may even be possible to run the API on one server, but still actually store the backups on a storage provider. There are important technical problems here that need to be solved to have a backup system that's reliable, robust, and efficient, but they too are interesting problems, and interesting problems is why I'm thinking about backup implementation.
Feedback
I'm not looking for suggestions on what backup software to use. Please don't suggest solutions.
I would be happy to hear other people's thoughts about backup software implementation. Or what needs and wants they have for backup solutions.
If you have any feedback on this post, please post them in the fediverse thread for this post.