I’m sometimes cruel to my tools to learn how they behave in unusual circumstances. For example, what is the fastest way to delete a very large number of files?
If the circumstances are just right, the answer may be mkfs
to create a new, empty file system. But it’s not always the appropriate answer.
The more obvious answer is sudo rm -rf
—or so you’d think. It turns out, however, that sometimes find -delete
is a lot faster, or so I learned when I did a little speed testing some years ago. I’ve been told that rsync -delete
is even faster than that. What is the current situation?
How would you test that? I would start by creating a large number of files in a directory tree, and then delete them. How many should I create?
Performance tends to only be interesting for large values of N. Thus, I’ve created a file system billion files. Since I pay for my own storage, I created empty files.
I did this twice: once using ext4, once with btrfs. Those are the file systems I currently care about. The disks were overwritten with zeroes first, and afterwards I stashed a way a copy compressed with xzip. The ext4 one is 11 GiB, the btrfs one 31 GiB.
I’ve unpacked the disk images to an LV attached to a VM, then run various commands to delete the file.
command | ext4 (s) | btrfs (s) |
---|---|---|
rm -rf | 18647 | 19783 |
find -delete | 18964 | 19702 |
rsync –del | 24279 | 53516 |
Based this entirely unscientific benchmark, the fastest way to delete files is either rm of find, with not much difference between them. rsync seems to be significantly slower.
I didn’t measure memory use or other factors. If someone knows of a CS student who could do this more formally for class, I’m sure it’d be a fascinating project. Have at it.
Links:
- https://events.static.linuxfound.org/slides/2010/linuxcon2010_wheeler.pdf
- slides of talk from 2010 about creating file systems with a billion files: back then it was a challenge to do that, now it’s not
- https://lwn.net/Articles/400629/
- LWN article about the talk
- http://git.liw.fi/billion-files/
- the scripts I used for this
- they’re stupid, but simple so they can be easily reviewed
- they run during my work day or overnight