This post is part of a series on backup software implementation. See
the backup-impl
tag for a list of all posts in
the series.
Encrypting backups
I want my backups to be encrypted at rest so that if someone gains access to the backup storage they can't see my data. I also want my backups to be signed so that I can trust the data I restore is the data I backed up. This is also called confidentiality and authentication of backed up data.
"At rest" means as stored on disk. I also want transfers to and from a backup server to be encrypted, but that's easy to achieve with TLS or SSH.
AEAD: authenticated encryption with associated data
Doing encryption and signing separately has turned out to be easy to get wrong. Since about the year 2000 there have been ways to achieve both with one operation, using authenticated encryption or its variant with associated data AEAD. This is easier to get right. In short, with authenticated encryption, if you can decrypt some data, you can be sure that the decrypted data is what was encrypted.
For AEAD, the two operations are:
encrypt
(plaintext, key, ad) → (cipher text, authentication tag)- the cipher text, authentication tag and ad are stored in backup storage
- at least some AEAD implementation make the cipher text and authentication tag part of the same output string, but that's an implementation detail; they're conceptually separate
decrypt
(cipher text, authentication tag, key, ad) → plaintext or error
In other words, you keep the associated data with the cipher text, as you'll need it to decrypt. If the decryption works, you know the associated data is also good (in addition the encrypted data). You do need to be careful not to trust the associated data until it's been authenticated.
For backups, each chunk of user data would be encrypted with AEAD, and the associated data is the checksum of the plain text data. When a backup client de-duplicates data, it splits data into chunks, computers the checksum of each, and searches the backup repository for chunks with that associated data.
When restoring a backup, the client decrypts the chunks, using the checksum. This also authenticates the data: if the decrypt operation fails, the data can't be used.
All this requires storing the checksum for each somewhere. There also needs to be ways to keep track of what backups there are, what files each contains, and what chunks belong to each file. We'll not worry about that yet. For now assume it's all done using magic.
Actually, the associated data for a chunk probably should not be the checksum of the plain text data. That leaks information: an attacker could determine that a file contains a specific document by looking for chunks with the same checksum as the document. Instead, the associated data could be an encrypted version of the checksum, or the result of some other similar transformation. For now, let's not worry about that.
Managing keys
Note that AEAD is a symmetric operation: the key must be kept secret. To complicate things, the client should support many keys for different groups of chunks. This is important especially so that different clients can share chunks in backup storage.
Imagine Alice and Bob both work for the same spy agency. They both get a lot of the same management reports and documents. They both also have confidential letters that they can't share each other. It would be ideal if their backup system let them mark which files are confidential and which can be shared, and then the chunks from those files can be shared or not shared with the other.
To implement this, the backup client needs to keep track of several keys. It also needs a way to keep track of which key each chunk is using. All these keys need to be computer generated and entirely random, for security. There is no hope of a user ever remembering any of them.
The keys should be stored in one place, which I tentatively call the "client chunk". This would be encrypted with yet another key, the "client key". The client key is stored in one or more "client credential" chunks, each of which is encrypted with separate key. This is similar to what the Linux full disk encryption system LUKS uses: the actual disk encryption key is encrypted with various passphrases, each encrypted key stored in a separate key slot. Because LUKS has a fixed amount of space for this, it limits the slots to eight. A backup program does not need to have that limitation: we can let the user as many client credential chunks as they want.
I'm assuming here that the backup storage allows lookup via the associated data. The client and credential chunks can then be found by using associated data "client-chunk" or "credential-chunk". If there are many matching chunks, the client needs to be able to determine which one it needs. (More magic. Waving my hands frantically.)
If the client chunks is updated to add a new key (or to drop one), the new client chunk is encrypted with the same key and uploaded to the backup store. All existing client credentials will continue to work. The old client chunk can then be deleted.
right
Data: cylinder "Data" "chunk"
move
move
Client: cylinder "Client" "chunk"
down
move
move
Pass: cylinder "Client" "credential" "passphrase" fit
left
move left from Pass.w
Yubi: cylinder "Client" "credential" "Yubikey" fit
move right from Pass.e
Tpm: cylinder "Client" "credential" "TPM" fit
arrow from Client.w to Data.e "chunk key" below thin
arrow from Pass.n to Client.s "client key" aligned below thin
arrow from Yubi.n to Client.sw "client key" aligned above thin
arrow from Client.se to Tpm.n "client key" aligned above thin <-
There can be any number of client credentials, which each encrypts the client key using a different method:
- a user-provided passphrase
- or a key derived from that with a key derivation function
- a hardware key
- TPM
- Yubikey challenge/response
- an SSH or OpenPGP key
- could be stored in a Yubikey or other hardware token
- hopefully there's more
To perform a backup or a restore, the client would need to be able to use any one of the credentials.
An interesting possible evolution of the above scheme might be to have some of the credential be split using a secret sharing setup: for normal use, the TPM credential might be used (but it would only enable making new backups and restoring backups, not deleting backups). For more unusual situations, you might need both a passphrase and a Yubikey credential. An unusual operation might be to delete backups, or to adjust the set of data chunk keys a client has.
Summary
Backup:
- get one or more credentials from user to decrypt the client key
- get and decrypt the client chunk, using the client key
- fail if this gives an error
- encrypt each new chunk with the right chunk key
- store the cipher text, authentication tag, and associated data in backup storage
Restore:
- get one or more credentials from user to decrypt the client key
- get and decrypt the client chunk, using the client key
- fail if this gives an error
- for each chunk that needs to be restored, decrypt it using the right
chunk key and associated data, making sure this works
- fail if this gives an error
There's a lot of steps skipped in this, but this is the shape of my current thinking about backup encryption. I am, however, not an expert on this, so I expect to get feedback telling me how to do this better.
Feedback
I'm not looking for suggestions on what backup software to use. Please don't suggest solutions.
I would be happy to hear other people's thoughts about backup software implementation. Or what needs and wants they have for backup solutions.
If you have any feedback on this post, please post them in the fediverse thread for this post.