Welcome to my web log. See the first post for an introduction. See the archive page for all posts. (There is an english language feed if you don’t want to see Finnish.)
Archives Tags Moderation policy Main site
Me on Mastodon, for anything that is too small to warrant a blog post.
All content outside of comments is copyrighted by Lars Wirzenius, and licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. Comments are copyrighted by their authors. (No new comments are allowed.)
This post is part of a series on backup software implementation. See the backup-impl tag for a list of all posts in the series.
This is another grab bag of random topics.
Snapshots vs deltas
It is common to talk about “full backups” that are complete self-standing copy of the data versus “incremental backup” that only has the changes since the previous backup. Being someone who has implemented backup software, I prefer to talk about “snapshot” versus “delta” backups.
In a backup system based on snapshots, each backup looks like a complete, self-standing backup, even if it’s implemented in a way where common data in several backups is only stored once. One way to implement this is to store each unique chunk of data only once, and each backup refers to the chunks in the files in that backup.
In one based on deltas, each incremental backup is a “delta” against a previous backup. Delta is used here in the mathematical sense of difference: the new backup might store a new file completely, but only the changed parts of a changed part.
The big difference, from my point of view, is that to restore a backup using snapshots is straightforward, but to restore using deltas you start from a full backup and then apply all the deltas needed to get the latest state. Applying deltas can be slower, and is often trickier to implement. “Tricky” is a technical term in software engineering that means “more likely to be wrong”.
In my opinion, deltas made a lot of sense for tape based backups: you have to at least seek past all the previous backups on the same tape any way, so you may as well restore deltas on the way. However, for backups stored in random access storage, such as hard drives, snapshots make a lot more sense.
Snapshots are even more important if you want to remove any specific backups, to recover space. This is very tricky with deltas, but can be quite straightforward with snapshot. (I say this as someone who has implemented this.)
For myself, I would only consider snapshots. This is influenced by my strong dislike of tape drives.
If you like tapes, by all means use them. If you want me to implement backup software that uses tapes for storage, the price is going to be higher.
File system deltas
File systems such as ZFS and btrfs support file system deltas. The file system itself constructs the delta, which can be exported as a regular file. The delta can be applied to another file system of the same type.
This can work really well, and it can be quite efficient.
However, I am personally not interested in requiring the same file system type to be used when restoring. I entirely reject this approach for any backup system I may or may not implement in the future.
Again, this is my personal choice. If you’re happy with file system deltas, use them. My preference doesn’t matter in that case.
Using rsync and directory trees of hard links
I have used, successfully and for years, directory trees of hard linked files. This means that each backup is a directory (e.g., 2024-12-24, 2024-12-25, etc). Every file (anything except directories) that is unchanged from the previous backup is stored as a hard link to the same file in the previous file.
The core of this is approximately:
$ rm -rf $new
$ cp -al $old $new
$ rsync -a --del $HOME/. $new/.This can work OK. The hard linking saves a ton of space, compared to storing each backup in full. Browsing old backups means looking at files on disk.
It’s also very easy to set up. The shell snippet above is almost everything you need. There are plenty of variants of this online, if you don’t want to make your own.
However, even though it’s my go-to approach for backups that don’t rely on complex backup software, it’s not something I particularly like. The main problem is that I have millions of precious files, and if each backup has all of them (even if hard linked), it becomes cumbersome to move backups to new storage, or even to remove old backups.
It turns out that dealing with very large numbers of files is not fun. Even when tools can cope, they are often slow. For example, I’ve not managed to use rsync to transfer a few hundred daily backup directories from one server to another: it always runs out of memory.
Even deleting a few hundred million hard links is slow.
I’d prefer a backup implementation that didn’t store each precious file as a separate file, but on the other hand, that is not going to be as simple as cp -al and rsync -a --del.
Feedback
I’m not looking for suggestions on what backup software to use. Please don’t suggest solutions.
I would be happy to hear other people’s thoughts about backup software implementation. Or what needs and wants they have for backup solutions.
If you have any feedback on this post, please post them in the fediverse thread for this post.
Solved: this works: -net nic,model=virtio -net user
My goal
I want to run a Debian cloud image with qemu-system-x86_64 so that the guest operating system has network access using user mode networking.
Context
I want to use this as part of my CI system (Ambient). I would prefer to not use TUN/TAP networking, or to set up a bridge on the host. I’m aware that user mode networking with QEMU is constrained and limited, and I’m OK with that.
I will explore the other options if user mode networking proves to be impossible.
What I’ve done
I’ve attached the script I’ve been experimenting with. To run, give it two arguments: the URL to the cloud image published by Debian, and the local filename where to store that.
You’ll need the following installed:
wgetqemu-imgqemu-system-x86_64genisoimage
You can run the script as an unprivileged user. If will run faster if you can use the Linux kernel kvm module, but it isn’t required. On my laptop the device takes about three minutes to run, assuming the image has been downloaded already.
What the script does is set up cloud-init to run ip a, and then run a VM with the cloud image and the cloud-init configuration, with two virtual serial ports directed to files console.log and run.log. The ip command output goes to the second one.
What I want is for there to be a virtual Ethernet device. I’ve not been able to make that happens. The ip output in run.log, for me, is:
xyzzy from bootcmd to ttyS1
xyzzy from runcmd to ttyS1
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
This lists only the lo or loopback network interface. I’d like that to list eth0 or another device via which the guest operating system can connect to the public Internet. For now, I’d be happy just to have the device: once I have that, I can tackle the problem of getting to have an IP address and outgoing Internet connectivity.
(The xyzzy lines are just for debugging. You can ignore them.)
My specific request
How should I change the script so that the ip output lists the network device I want?
I’ve tried many different variations of network related QEMU options, but nothing seems to give me what I want. I’m sure I’m doing something wrong, and I’m hoping it’s obvious to someone else.
Can you help me?
You can respond to by email (liw@liw.fi), or on this fediverse thread, or in any other way you can reach me.
The script
#!/bin/bash
#
# Usage: scripts/qemu.sh https://cloud.debian.org/images/cloud/bookworm/latest/debian-12-genericcloud-amd64.qcow2 /tmp/debian12.qcow2
set -euo pipefail
# I use the generic Debian 12 image from
# https://cloud.debian.org/images/cloud/bookworm/latest/
url="$1"
image="$2"
if [ ! -e "$image" ]; then
wget -c "$url" -O "$image"
chmod a-w "$image"
fi
tmp="$(mktemp -d)"
trap 'rm -rf "$tmp"' EXIT
# Make a copy-on-write image, on top of the base image. This allows us
# to modify things in the file system inside the running VM, without
# affecting the base image.
qemu-img create -b "$image" -F qcow2 -f qcow2 "$tmp/vm.qcow2"
# Create a cloud-init local data source: an ISO image of a specific
# form. The user-data runcmd writes to the run log (ttyS1) what
# network interfaces are known, with the ip command.
mkdir "$tmp/cloud-init"
cat <<EOF >"$tmp/cloud-init/user-data"
#cloud-config
bootcmd:
- echo xyzzy from bootcmd to ttyS0 > /dev/ttyS0
- echo xyzzy from bootcmd to ttyS1 > /dev/ttyS1
runcmd:
- echo xyzzy from runcmd to ttyS0 > /dev/ttyS0
- echo xyzzy from runcmd to ttyS1 > /dev/ttyS1
- ip a > /dev/ttyS1
- poweroff
EOF
cat <<EOF >"$tmp/cloud-init/meta-data"
hostname: ambient
EOF
cat <<EOF >"$tmp/cloud-init/network-config"
network:
version: 2
ethernets:
eth0:
dhcp4: true
EOF
genisoimage -quiet -volid CIDATA -joliet -rock -output "$tmp/cloud-init.iso" "$tmp/cloud-init"
# Run the VM.
qemu-system-x86_64 \
-m 2048 \
-display none \
-serial file:console.log \
-serial file:run.log \
-drive format=qcow2,if=virtio,file="$tmp/vm.qcow2" \
-cdrom "$tmp/cloud-init.iso" \
-nic user
This post is part of a series on backup software implementation. See the backup-impl tag for a list of all posts in the series.
For this part I have not been able to allocate enough time or energy to do deep thinking, so I’m going to list some ideas that I think are important. I may return to them later.
Content sensitive chunking
My understanding is that both Borg backup and Restic split files into chunks in a way that finds duplicate chunks regardless of where in a file the chunk is. This can make de-duplication much more efficient.
I think the basic approach is to compute a weak checksum for every byte, and when the low N bits of the checksum value are zeroes, that’s the end of the chunk.
I’ve not implemented this myself, but I hear it works well.
I don’t know of any research into how well it works. I’d be interested in reading about what checksum algorithm, with what value of N, works best for which type of data. If nobody has researched this yet, I think it’d make an interesting topic of a BSc or MSc. (If you know of such research, I would appreciate a pointer!)
Real time backups
In an ideal situation, backups can happen while a computer is in use. If you are in a meeting, or just working at your desk, or in a cafe, backups happen while you work. When you’re read to leave, you suspend or turn off your computer and any work you’ve done is already backed up.
There are many technical problems to solve to achieve this, but it’s an interesting goal.
Read time restores
When you need to restore all of a backup, such as when setting up a new computer, the process can take a very long time. In a calm, serene situation, it’s easy to wait for that to happen. It’s an opportunity to have some tea, and contemplate the beauty of a flower. For those who need to restore their data to prevent the apocalypse, it would be convenient to be able to start using the computer as soon as possible.
This can be achieved by at least two different approaches:
- First restore the bits that you need now, then let the rest be restored at leisure. This would be fairly simple to implement, but requires knowing what you need first.
- Have a way to use the backed up data without restoring it, such as by mounting the backup as an external disk. This is again fairly simple to implement, for read-only use.
For read-write normal use, a more sophisticated and complicated (and thus fragile and error prone) approach could be developed: an overlay file system on top of the new computer, mounted on top of the file system where data is being restored. When you use a file that’s not yet restored, it gets read from the backup. If it has already been restored, it’s served from the local disk. If you write to a file, it’s done in a copy-on-write manner. Any file you use gets bumped to the head of the restore queue.
The happy scenario is this:
- You get a new laptop.
- You install an operating system. For myself I’ve managed to automate this and make it fast, as little as five minutes, for a minimal installation.
- Start the process of restoring all the data in you home directory.
- Log in, start using the computer normally. The restore happens in the background without bothering your use of the computer.
- From getting the laptop to being able to use it takes only a few minutes.
This is probably quite difficult to implement. I don’t expect to even think about it any time soon, but it’s an interesting problem.
Backup server with mutually distrusting users
Imagine Alice and Bob, who have a deep, mutual hatred of each other. They would both gladly do some things to annoy the other, or to cause the other to lose data, or access to their backups.
Can they trust the same backup server? Under what conditions? Can they, even, share backed up data, without opening an attack vector for the other?
I don’t know. This is, again, a interesting problem. I’m not going to think about it until I have thought about implementing backups for people who have mutual trust.
Trusting a backup server
Speaking of trust, if you trust all other users of a backup server, but you don’t run the server yourself, how much and in what ways do you have to trust the server and its operator?
I think the following are going to be necessary at minimum:
- You trust that the server doesn’t remove data on its own authority. Ideally, the backup software can verify that the backup storage contains all the backups that the user expects it to contain, but this is tricky to achieve in a scenario where the user has lost everything, except access to the backup storage.
- You trust that the server or backup storage is available when you need it to be, to make a new backup or to restore data.
There may be more, but that’s what I can think of so far.
I think the following don’t require trusting the server:
- The server doesn’t modify backups stored on it. This is easy to guard against using encryption.
- The server doesn’t inspect or leak backed up data. Again, encryption guards against this. (Specifically client-side encryption).
Security is difficult, but important. Ideally, I’d develop threat models and such for this, but we’ll see. I’m not a security expert. But this is where my thinking on this is currently.
Feedback
I’m not looking for suggestions on what backup software to use. Please don’t suggest solutions.
I would be happy to hear other people’s thoughts about backup software implementation. Or what needs and wants they have for backup solutions.
If you have any feedback on this post, please post them in the fediverse thread for this post.
This post is part of a series on backup software implementation. See the backup-impl tag for a list of all posts in the series.
Encrypting backups
I want my backups to be encrypted at rest so that if someone gains access to the backup storage they can’t see my data. I also want my backups to be signed so that I can trust the data I restore is the data I backed up. This is also called confidentiality and authentication of backed up data.
“At rest” means as stored on disk. I also want transfers to and from a backup server to be encrypted, but that’s easy to achieve with TLS or SSH.
AEAD: authenticated encryption with associated data
Doing encryption and signing separately has turned out to be easy to get wrong. Since about the year 2000 there have been ways to achieve both with one operation, using authenticated encryption or its variant with associated data AEAD. This is easier to get right. In short, with authenticated encryption, if you can decrypt some data, you can be sure that the decrypted data is what was encrypted.
For AEAD, the two operations are:
encrypt(plaintext, key, ad) → (cipher text, authentication tag)- the cipher text, authentication tag and ad are stored in backup storage
- at least some AEAD implementation make the cipher text and authentication tag part of the same output string, but that’s an implementation detail; they’re conceptually separate
decrypt(cipher text, authentication tag, key, ad) → plaintext or error
In other words, you keep the associated data with the cipher text, as you’ll need it to decrypt. If the decryption works, you know the associated data is also good (in addition the encrypted data). You do need to be careful not to trust the associated data until it’s been authenticated.
For backups, each chunk of user data would be encrypted with AEAD, and the associated data is the checksum of the plain text data. When a backup client de-duplicates data, it splits data into chunks, computers the checksum of each, and searches the backup repository for chunks with that associated data.
When restoring a backup, the client decrypts the chunks, using the checksum. This also authenticates the data: if the decrypt operation fails, the data can’t be used.
All this requires storing the checksum for each somewhere. There also needs to be ways to keep track of what backups there are, what files each contains, and what chunks belong to each file. We’ll not worry about that yet. For now assume it’s all done using magic.
Actually, the associated data for a chunk probably should not be the checksum of the plain text data. That leaks information: an attacker could determine that a file contains a specific document by looking for chunks with the same checksum as the document. Instead, the associated data could be an encrypted version of the checksum, or the result of some other similar transformation. For now, let’s not worry about that.
Managing keys
Note that AEAD is a symmetric operation: the key must be kept secret. To complicate things, the client should support many keys for different groups of chunks. This is important especially so that different clients can share chunks in backup storage.
Imagine Alice and Bob both work for the same spy agency. They both get a lot of the same management reports and documents. They both also have confidential letters that they can’t share each other. It would be ideal if their backup system let them mark which files are confidential and which can be shared, and then the chunks from those files can be shared or not shared with the other.
To implement this, the backup client needs to keep track of several keys. It also needs a way to keep track of which key each chunk is using. All these keys need to be computer generated and entirely random, for security. There is no hope of a user ever remembering any of them.
The keys should be stored in one place, which I tentatively call the “client chunk”. This would be encrypted with yet another key, the “client key”. The client key is stored in one or more “client credential” chunks, each of which is encrypted with separate key. This is similar to what the Linux full disk encryption system LUKS uses: the actual disk encryption key is encrypted with various passphrases, each encrypted key stored in a separate key slot. Because LUKS has a fixed amount of space for this, it limits the slots to eight. A backup program does not need to have that limitation: we can let the user as many client credential chunks as they want.
I’m assuming here that the backup storage allows lookup via the associated data. The client and credential chunks can then be found by using associated data “client-chunk” or “credential-chunk”. If there are many matching chunks, the client needs to be able to determine which one it needs. (More magic. Waving my hands frantically.)
If the client chunks is updated to add a new key (or to drop one), the new client chunk is encrypted with the same key and uploaded to the backup store. All existing client credentials will continue to work. The old client chunk can then be deleted.
There can be any number of client credentials, which each encrypts the client key using a different method:
- a user-provided passphrase
- or a key derived from that with a key derivation function
- a hardware key
- TPM
- Yubikey challenge/response
- an SSH or OpenPGP key
- could be stored in a Yubikey or other hardware token
- hopefully there’s more
To perform a backup or a restore, the client would need to be able to use any one of the credentials.
An interesting possible evolution of the above scheme might be to have some of the credential be split using a secret sharing setup: for normal use, the TPM credential might be used (but it would only enable making new backups and restoring backups, not deleting backups). For more unusual situations, you might need both a passphrase and a Yubikey credential. An unusual operation might be to delete backups, or to adjust the set of data chunk keys a client has.
Summary
Backup:
- get one or more credentials from user to decrypt the client key
- get and decrypt the client chunk, using the client key
- fail if this gives an error
- encrypt each new chunk with the right chunk key
- store the cipher text, authentication tag, and associated data in backup storage
Restore:
- get one or more credentials from user to decrypt the client key
- get and decrypt the client chunk, using the client key
- fail if this gives an error
- for each chunk that needs to be restored, decrypt it using the right chunk key and associated data, making sure this works
- fail if this gives an error
There’s a lot of steps skipped in this, but this is the shape of my current thinking about backup encryption. I am, however, not an expert on this, so I expect to get feedback telling me how to do this better.
Feedback
I’m not looking for suggestions on what backup software to use. Please don’t suggest solutions.
I would be happy to hear other people’s thoughts about backup software implementation. Or what needs and wants they have for backup solutions.
If you have any feedback on this post, please post them in the fediverse thread for this post.
You may have seen the following error message on a Unix system:
No such file or directory
This comes from the Unix standard library for the C language. It's the
textual representation of errno value ENOENT. errno is a global
variable provided by the C standard library variable for the error
code specifying the cause of failure for a system call.
In Unix, when a program asks the operating system kernel to do something, it does this by using a system call. This might be something like "open a file for reading". For this blog post, it's "run this program".
If the system call fails, it returns an error indicator. For most system calls the return value is an integer and -1 tells the caller the system call failed. The indicator value does not tell the caller why the system call failed. Perhaps the program doesn't exist? Perhaps it exists, but the user does not have permission to run it?
The reason for the failure is stored in the errno global variable,
by the C standard library. Other languages provide ways of getting the
value. The errno value is also an integer, and there are C macros
defined in the errno.h include
file for the various possible
values, and the linked Wikipedia page has examples. The integer value
can be translated into a static textual message using the C standard
library function
strerror(3).
A single integer can't describe the cause of the problem with much
detail. Unix programs are meant to know what system call they used and
what arguments they gave to it, and use this to construct a useful
error message. For example, if the program was opening a file, it
should use the name of file it tried to open combined with the text
from strerror to produce an error message like:
Failed to run /does/not/exist: No such file or directory
The C standard library does not have tools to make such error messages
easily. Thus, most Unix programs just print out the text returned by
strerror without additional information.
If you think this is a convoluted, sub-par way of dealing with errors, and that it's insane and stupid that this has situation has lasted largely unchanged since the 1970s, I'm not going to stop you.
Rust is not better by default. If you use the Rust std::process::Command data type to run a program, without taking care, you'll end up with the same error message:
No such file or directory (os error 2)
There is the additional information of "os error code 2", which is the
errno value, but that's of no use to a user.
The above message comes from this Rust snippet:
match Command::new("/does/not/exist").output() {
Ok(output) => {
panic!("this was not meant to succeed");
}
Err(e) => eprintln!("{e}"),
}
As a user I'd like to know at least what did the program I run try to do and using what file. To achieve this in Rust, we need to inspect the returned error in more detail. It's doable with the Rust standard library, but it does require a little effort.
Below is an example of how to do it. It uses thiserror to define an
error type specific for this, but no other crates apart from the
standard library. It's not perfect, and the error messages can
certainly be improved, but it's a start.
// This is an example of handing errors when executing another
// program. We use the `std::process::Command` type in the standard
// library to do this, but add a little wrapper to handle the various
// errors that can go wrong. As I only use Unix, this is a little Unix
// specific.
//
// I wrote this because I was annoyed by programs saying "No such file
// or directory" when trying to run a program that doesn't exist, and
// I wanted to make sure Rust programs can do better.
use std::{
// We need this trait to handle underlying errors, when writing
// out error messages.
error::Error,
// The program name is technically an OsString. We avoid the
// simplifying assumption that it is UTF8, because there is no
// guarantee that the name of a program is UTF8, in Unix.
ffi::OsString,
// We need these to actually invoke programs.
process::{Command, Output, Stdio},
};
// On Unix, we need this to find out which signal terminated the
// program.
#[cfg(unix)]
use std::os::unix::process::ExitStatusExt;
// This main program exercises the spawn function we have below.
fn main() {
// Create a `Command` to run the first command line argument, but
// don't start running it yet.
let mut cmd = Command::new(std::env::args().nth(1).unwrap());
// Optionally, redirect the command's standard I/O. Here we close
// stdin, and capture stdout and stderr. Any other setup can be
// done at this point as well. This is a separate variable from
// `cmd`, because the methods we use return a mutable reference to
// `cmd`, instead of transferring ownership.
let cmd2 = cmd
.stdin(Stdio::null())
.stdout(Stdio::piped())
.stderr(Stdio::piped());
// Actually run the command and wait for it to terminate. The
// `spawn` function starts running the program, if that is
// possible.
match spawn(cmd2) {
// All went well.
Ok(output) => {
let stdout = String::from_utf8_lossy(&output.stdout);
let stderr = String::from_utf8_lossy(&output.stderr);
println!("captured stdout: {stdout:?}");
println!("captured stderr: {stderr:?}");
println!("All good");
}
// Program ran, but it failed. Report that and also its
// captured stdout and stderr, assuming we set that up above.
Err(CommmandError::CommandFailed {
program,
exit_code,
output,
}) => {
eprintln!("ERROR: {program:?} failed: {exit_code}");
let stdout = String::from_utf8_lossy(&output.stdout);
let stderr = String::from_utf8_lossy(&output.stderr);
println!("captured stdout: {stdout:?}");
println!("captured stderr: {stderr:?}");
}
// Report any other program, including its underlying problem,
// if any.
Err(e) => {
eprintln!("ERROR: failed to run program: {e}");
let mut e = e.source();
while e.is_some() {
let underlying = e.unwrap();
eprintln!("caused by: {underlying}");
e = underlying.source();
}
std::process::exit(42);
}
}
}
// Given a `Command` that has been set up, spawn the child process,
// and wait for it to finish, then handle the result.
fn spawn(cmd: &mut Command) -> Result<Output, CommmandError> {
let child = match cmd.spawn() {
// Child process started running OK.
Ok(child) => child,
// Program does not exist.
Err(err) if err.kind() == std::io::ErrorKind::NotFound => {
return Err(CommmandError::NoSuchCommand(cmd.get_program().into()))
}
// We lack permission to run program.
Err(err) if err.kind() == std::io::ErrorKind::PermissionDenied => {
return Err(CommmandError::NoPermmission(cmd.get_program().into()))
}
// Other problem prevented the program from starting.
Err(err) => return Err(CommmandError::Other(cmd.get_program().into(), err)),
};
// Wait for child to terminate, and capture its output, in case
// that was set up.
match child.wait_with_output() {
// Child terminated, but it may have failed.
Ok(output) => {
// Did the child terminate due to a signal?
#[cfg(unix)]
if let Some(signal) = output.status.signal() {
return Err(CommmandError::KilledBySignal(
cmd.get_program().into(),
signal,
));
}
// Did the child terminate with a non-zero exit code?
if let Some(code) = output.status.code() {
if code != 0 {
return Err(CommmandError::CommandFailed {
program: cmd.get_program().into(),
exit_code: code,
output,
});
}
}
// At this point we know the child terminated, because we
// used `wait_with_output`. We also know that it didn't
// fail, because it didn't get killed by a signal, and it
// didn't have a non-zero exit code.
assert!(output.status.success());
Ok(output)
}
// Something unexpected went wrong.
Err(err) => Err(CommmandError::Other(cmd.get_program().into(), err)),
}
}
// All the errors that can go wrong when running a program. We embed
// the name of the program in the error variants, so it doesn't get
// lost. This makes for better error messages in my opinion.
#[derive(Debug, thiserror::Error)]
enum CommmandError {
// The program doesn't exist. Or, possibly, the program specifies
// an interpreter or shared library that does not exist. The
// operating system doesn't tell us which.
#[error("command does not exist: {0:?}")]
NoSuchCommand(OsString),
// The program exists, but we lack the permission to run it. On
// Unix, this means the program file lacks the x bit for us.
#[error("no permission to run command: {0:?}")]
NoPermmission(OsString),
// The program ran, but terminated with a non-zero exit code. Note
// that this error variant includes any captured stdout and stderr
// output, if the `Command` was created to capture them.
#[error("command failed: {program:?}: exit code {exit_code}")]
CommandFailed {
program: OsString,
exit_code: i32,
output: Output,
},
// The program ran, but was terminated by a signal.
#[cfg(unix)]
#[error("command was terminated by signal {1}: {0:?}")]
KilledBySignal(OsString, i32),
// There was some other error. There can be any number of errors,
// and over time they can vary. We can't know everything, so we
// have a catchall error variant to handle the unknowns.
#[error("unknown error while running command: {0:?}")]
Other(OsString, #[source] std::io::Error),
}
I've not tried to encapsulate this into a library. Maybe some of the myriad existing Rust libraries to run programs already does this.
This post is part of a series on backup software implementation. See the backup-impl tag for a list of all posts in the series.
Export and import of backups
If one cares about the longevity of backed up data, it seems sensible to try to worry about the inevitable situation when fundamental decisions made for existing backups need to be changed:
- What backup software is used?
- How is data split into chunks? How big are the chunks?
- How is data compressed?
- How is data encrypted?
If, say, a new compression algorithm is developed that results in significantly smaller compressed data, one may want to re-compress all ones existing backups. Or one may want to switch to a new encryption method that’s more secure that what has been used so far.
Or one may find much better backup software in the future.
For these and other reasons, one may want to convert one’s existing backups to a new form. This is a problem that version control systems have had for a while, and the same approach would work for backups: an “export format” that’s independent of the backup software (see [git export}(https://git-scm.com/docs/git-fast-export) for an example).
Thus, if one backup system can export existing backups in a common format, and another can import, then converting backups should be quite easy. (For version control systems, there’s a lot of history and details that vary between systems that make this somewhat difficult, but in principle it’s easy.)
I have not designed a backup export format yet. It’s too early for that, I think, even if I first had this idea years ago. The first step would be to gather needs and wants, and that is a job in itself. My current list:
- the format should enable streaming, to avoid needing large amounts of backup space
- likewise, the format should enable incremental conversion
- the format should allow filtering
- e.g., to drop all cat pictures
Example:
obnam1 export --all | filter-out-cat-photos | obnam2 import
I’m sure people can come up with any number of innovative ways to use such a filtering system. For me, I like the export/import approach because it allows me to change my backup parameters after the fact, and breaks the lock-in to the backup system I’ve chosen to use. I do not, however, know of any implementation of the concept.
Feedback
I’m not looking for suggestions on what backup software to use. Please don’t suggest solutions.
I would be happy to hear other people’s thoughts about backup software implementation. Or what needs and wants they have for backup solutions.
If you have any feedback on this post, please post them in the fediverse thread for this post.
This post is part of a series on backup software implementation. See the backup-impl tag for a list of all posts in the series.
Updates on previous points
I had some useful feedback to my previous two posts.
Hash function
Ed Davies asked why the hash function needs to be cryptographically secure. I realized that I mixed up two things: accidental collisions, which doesn’t need security, and attacks, which does. For avoiding accidental collisions, any strong hash function will do, such as MD5.
However, because a backup program can’t safely assume the data it operates on is benign, it needs to be secure against malicious data provided by an attacker. Web browsers, local mail user agents, file downloads, etc, are ways in which an attacker may inject malicious data on a user’s system. In this context, the malicious data would be data constructed to cause a hash collision with the data that the user has otherwise.
If the backup software only relies on the hash function, the malicious data might prevent valuable data from being backed up. The two ways I know of how to prevent that is to use a cryptographically secure hash function, or to compare data when hashes match. Data comparison can be very expensive, as it requires downloading backed up data from the server. Thus, unless the user is willing to pay to cost of comparison, using a cryptographically secure hash function makes sense.
Storage location
Jonathan McDowell raised the point that where backups are stored can be crucial. In this series of blog posts I’ve mostly been thinking about how backups are implemented, and ignoring how the storage is provided. The point of the cost is an important one, though. While I’m not willing to think about how to design a backup system that relies on any specific storage provider, it’s important that the design of a backup implementation allows the user to choose a way to store and access their backups that suits them.
A backup system that costs too much to use, or is not available when the user needs it, is of no use.
Adam Bark points out that the “backup server API” and the actual backup storage need not reside on the same machine. One might, for example, deploy the API on the local machine, but back up to a storage provider. It may even be possible to run the API on one server, but still actually store the backups on a storage provider. There are important technical problems here that need to be solved to have a backup system that’s reliable, robust, and efficient, but they too are interesting problems, and interesting problems is why I’m thinking about backup implementation.
Feedback
I’m not looking for suggestions on what backup software to use. Please don’t suggest solutions.
I would be happy to hear other people’s thoughts about backup software implementation. Or what needs and wants they have for backup solutions.
If you have any feedback on this post, please post them in the fediverse thread for this post.
I’ve made a new CI engine lets me run CI on untrusted code without having to worry. I call it Ambient, and it’s quite awful to use, but works for me. The web site is also quite horrifyingly ugly. I’m a hacker, I don’t understand marketing.
Not sure if Ambient is of much interest to anyone else, but I would welcome help in making it nicer. There’s a lot of low hanging fruit, I’m sure.
Ambient runs the CI project in a virtual machine, under qemu-system, without network access. The CI run has limits on CPU cores, RAM, disk space, and run time it can use. The limits are set by the person running Ambient, not by the project. On my Framework laptop it takes about four seconds to run a dummy CI project that just runs echo hello world.
Current status is that it’s my personal CI system. I build all my web sites using it, and build and test all my personal projects with it. I also build and publish Debian packages for some of my software, using Ambient. However, as I’m very lazy, I will happily read binary log files with less if it saves me from having to implement a better run log.
(In case it matter: the code is in Rust and is licensed under GNU GPL v3 or later. There is no actual release yet, but you can install it from the source tree.)
This post is part of a series on backup software implementation. See
the backup-impl tag for a list of all posts in
the series.
Very high level architectural assumptions
The following assumptions of the software architecture of a backup system are less firmly set in stone than the table stakes in my first post. However, having spent two decades thinking about this, they make sense to me. If you think I'm wrong, feel free to tell me how and why (see end for how).
- Backed up data is split into chunks of a suitable size. This makes
de-duplication simple: by splitting files into chunks in just the
right way, identical data that occurs in multiple places can be
stored only once in the backup. The simplest example for this is
when a file is renamed, but not otherwise modified. An sensible
backup system will notice the rename, only stores the new name, not
all the data in the file all over again.
- De-duplication can be done at quite a fine-grained granularity or a coarse one. There are a number of approaches here. At this high-level of architecture thinking, we don't need to care how the splitting into chunks happen. We do need to take care that the size of chunks can vary and that the backup storage can't care about the specifics of chunk splitting.
- There are ways to do "content sensitive" chunk splitting so that the same bit of data is recognized as a chunk event if it's preceded by other data. This is exciting, but I don't know of any research about how much this actually finds duplicate data in real data sets. A flexible backup system might need to support many ways to split data into chunks, so that the optimal method is used for each subset of the precious data being backed up.
- I note that the finest possible granularity here is the bit, but it would be ridiculous to that far. However the backup system is implemented, each chunk is going to incur some overhead, and if the chunks are too small, even the slightest overhead is going to be too much. A backup system needs to strike a suitable balance here.
- To achieve de-duplication, the backup system needs a way to detect
that two chunks are identical. A popular way to do this is to use
cryptographically secure checksum, or hash, such as
SHA3. An important feature of
them is that if two chunks have the same hash, they are almost
certainly identical in content (if the hashes are different, the
chunks are absolutely certain to be different). It can be much more
efficient to compute and compare hashes than to retrieve and compare
chunk data. This is probably good for most people most of the time.
- However, for the people who do research into hash function collisions, it's not good enough. It makes a sad researcher who spends a century of CPU time to create a hash collision, then makes a backup of the generated data, and when restoring their data finds out that the backup system decided that the two files with the same checksum were in fact identical.
- A backup system could make this configurable, possibly on a per-directory basis. A hash collision researcher can mark the directory where they store hash collisions as "compare to de-duplicate".
- I admit this is a very rare use case, but it preys on my mind. At this level of software architectural thinking, the crucial point is whether to make the backup system use content hashes as the only chunk identifiers, or if chunk identifiers should be independent of the content.
- I really like the SSH
protocol and its
SFTP
sub-system for data transfer. I don't particularly like it for
accessing a backup server. The needs of a backup system are
sufficiently different from the needs of generic remote file
transfer and file system access that I don't recommend using SFTP
for this. For example, it's tricky to set up an SFTP server that
allows a backup client to make a new backup, or to restore an
existing backup, but does not it allow deleting a backup. It makes
more sense to me to build a custom HTTP API for the backup server.
- It seems important to me that one can authorize ones various devices to make new backups automatically, but not allow them to delete old backups. This mitigates the situation where a device is compromised. A compromised client can't destroy data that has been backed up, even if can make new backups with non-sense or corrupt data.
Feedback
I'm not looking for suggestions on what backup software to use. Please don't suggest solutions.
I would be happy to hear other people's thoughts about backup software implementation. Or what needs and wants they have for backup solutions.
If you have any feedback on this post, please post them in the fediverse thread for this post.
Background
I am a twice failed backup software developer. I'm not currently intending to write new backup software, but I keep thinking about the technical problems in implementing backup software. This is a first in a series of blog posts about that.
In 2004 I set out to implement a new program for doing backups. This
eventually became known as "Obnam version 1". (It's initial name was
"backup script", or bs on the command line. I have been told I don't
understand marketing.)
Obnam 1 was mildly successful, in that for some number of people it provided a backup solution reasonably well. I don't know how many people: I don't add tracking or surveillance to my software. Probably at least hundreds, based on Debian popcon data. Arguably, the biggest achievement of Obnam 1 is that it inspired better competitors.
Obnam 1 was implemented in Python. It did coarse de-duplication, and client-side encryption, and could use an SSH server for storage using SFTP. I retired Obnam 1 in 2017 , after it was no longer fun to work on as a hobby. The software was slow, and making changes was tedious.
In 2020, I realized I couldn't stop thinking about how to implement backup software, and I had recently learned the Rust language, so I started to build Obnam 2, in Rust. It was fun for a while. A couple of years later, I had lost momentum and energy for that, too. Like with Obnam 1, I had made software that kind of worked, but that was tedious to change. I've not officially retired Obnam 2, but that current code base does not seem like something I want to build on as a hobby.
In retrospect, I think the biggest thing that went wrong with both Obnam 1 and 2 is that I rushed to get to a state where I am able to use for my own needs. That I made compromises that later turned out to be hard to undo or change. You might call this "technical debt", though I hope you don't, as I don't like that concept. I prefer to think of this as building a shaky foundation for my backup house, and picking the wrong kind of wood for the roof support beams. Changing any of that would require building a new house, by changing the old house one brick, plan, or nail at a time, while people were living in it. Doable, but not fun in a hobby project.
It's now 2024, and I still can't stop thinking about how to implement backup software. I'm beginning to suspect I may have a little bit of an obsession. At this point, it's probably best for me to concentrate on thinking about the problems, and their possible solutions, rather then actually building software. That's the funnest part of this.
I'm going post my thoughts about this as a series of blog posts here
on my personal blog. I don't know how long this will be, nor how
frequent. For each post, I'll start a fediverse thread, in case anyone
has comments on what I've written. I will also tag each post in this
series with backup-impl, and
you can subscribe to the RSS or Atom feed for that tag if you want to.
What are backups, anyway?
Backups are actually not important to anyone. What matters is that you can recover your data, after you primary copy of it is corrupted or lost. Restoring is important. Rather than try to get people to adopt new terminology, I'll stick to talking about backups, but I wanted to make this point early on.
I'll use the following terminology in this blog series.
- Primary copy of your data is the one you work with. It's on your laptop, desktop, server, phone, or other computing device. If you need to look up or modify a document, photo, or whatever, the primary copy is what you use.
- Backup copy is an independent snapshot of the primary copy at a given time and that you recover you data from in the case of an emergency.
- Restore is the process of recovering your data from a backup copy.
It's important that the backup copy is independent from the primary copy. This means that, say, a database replica that gets updated automatically whenever the primary database is updated, is not a backup. Likewise, a RAID array is not a backup. Both of these are good for other disaster recovery, but they don't you recover data you've deleted or corrupted.
A copy of the data on the same hard drive can be considered a backup copy, for some disaster scenarios. It protects you against the primary copy being corrupted or deleted, but not against the hard drive failing. It's up to you to decide what threats you want your backups to protect against, and this will inform you of whether you need a backup copy on a different hard drive, in a different computer, on a different continent, or possibly in a different universe.
An important point about backups is that you don't know that you have a valid backup unless and until you have successfully verified that you can restore the data.
Table stakes for backups
When I think about backups, I have a bunch of assumptions that are usually unstated. Unstated assumptions lead to confusing discussion. Here are some of the assumptions I make, made explicit:
- The user has precious primary data on their device, stored in files
in a file system.
- data that is not precious does not need to be backed up
- the user decides what is precious for them, the backup system assumes everything is, unless told otherwise
- I'm not currently concerned about data in memory, on other devices, in actively updated databases, or other such scenarios; they are not unimportant, but out of scope for me, at least for now
- The user can make a backup to local storage, or a remote server.
- local storage is anything that the backup software can access via the file system, and is probably something like a USB drive
- remote server is accessed over the network in some manner, probably using an HTTP API, with the API provided by some backup software component that does also does access control
- I am not concerned with non-file system storage, such as tapes.
- Backups are encrypted and authenticated on the client.
- the backup software can verity, using cryptography, that the backup data it retrieves from backup storage is what was put into the storage and hasn't been modified in between
- a backup server, if one is used, does not have access to, or care, about the contents in the backed up data
- Users should not have to trust the backup server more than they have
to.
- they have to inherently trust that the server doesn't delete or corrupt backup intentionally
- they should not have to trust that the server doesn't snoop on the users, because the backups are encrypted on the client
- Users who trust each other can share the backup storage in a way
that allows them to share backed up data.
- if Alice and Bob both have a copy of the same large file, and Alice makes a backup of it first, Bob should not have to back it up again
- this is called "de-duplication", of which I will have more to say later
- of course, Alice and Bob might just be two devices owned by the same person, instead of being different people, but from the backup software point of view, this seems to like an unimportant distinction
- this mutual de-duplication only applies to users who opt to trust each other
- it may be too difficult a problem to design a backup system that allows mutually hostile users to share the backup storage, and I'm not going to try think about that; I'm not even sure it makes sense for mutually hostile people to share backed up data, even if it's an interesting technical problem of how to do that
- When a user is facing a disaster, their backup system should require
them to have as few things as possible to recover. Ideally, the user
should know where their backups are, their credentials to access
their backup storage and their encryption keys.
- I do not want to assume the user necessarily has a copy of an encryption key, or an encryption device. Ideally, if the user remembers their backup server, username, and passphrase, they should be to recover their data.
- However, this is also something that different people have different needs about. The backup system should be able to cater to different needs here.
- I'm not interested in backup systems that assume a specific file
system or storage technology, such as
btrfs send. I don't want restores to be tied to using the same file system where the backup was made. - I want backups that are independent snapshots, not merely a delta for the previous backup. Deltas become hard to manage and a limit on run time performance of the backup system. Snapshots make it easy to remove specific backups, or to browse them.
These are all things I'm reluctant to change, or to make compromises on. Your assumptions may be different, and that's OK, but this is my blog, and my thought process, and my assumptions apply.
Feedback
This blog post and its possible follow-ups is just me thinking aloud. I make no promises about implementing any of this, ever. I'd very much like to, but I don't know if I will have the time and energy. If I do, I might build something only for myself. However, if you'd like to pay me to build a backup system for you, I'm happy to invoice for my time via my company. (The last sentence was blatant advertising that your ad blocker didn't detect.)
I'm not looking for suggestions on what backup software to use. Please don't suggest solutions.
I would be happy to hear other people's thoughts about backup software implementation. Or what needs and wants they have for backup solutions.
If you have any feedback on this post, please post them in the fediverse thread for this post.