As part of my development of a backup application, I run benchmarks on it, and that means creating a test data set, running some backups, and then removing the test data set. One of the test data sets I use is a 140 gibibytes of my real data. The benchmark first copies the data to a temporary location.
In other words, a fair bit of my current life is spent waiting for files to be copied and removed. The faster that goes, the better.
Overnight, I ran a little benchmark on those operations, to compare a couple of ways to do them. The results are below:
elapsed cmd
(s)
107.2 rm -rf tmp/data
98.4 find tmp/data -delete
100.1 find tmp/data -exec rm -rf {} +
116.2 find tmp/data -depth -print0 | xargs -0 rm -rf
elapsed cmd
(s)
3567.5 cp -a tmp/data tmp/copy
3219.5 cd tmp && mkdir copy && tar -C data -cf - . | tar -C copy -xf -
It is surprising, but it's clear that find is significantly faster than rm in deleting files, by almost ten percent. Since performance is a feature, this would indicate that that feature in rm is buggy.
For file copying, the piping of two tars is a common trick, and it really is faster, again by almost ten percent.
Obviously, there might also be a problem with the benchmark. I attach
the script, which uses benchmark-cmd in
extrautils, which I wrote for this kind
of thing. If there is a problem with the benchmark, don't hesitate
to provide a patch to fix that.
There may be other ways to remove or copy files that should be compared,
too. rsync? cpio? For file removal, a tool using Linux getdents directly
would probably be faster than the portable code in GNU coreutils and
findutils. Somebody should write that and compare.
In all cases above, the test data set to be removed or copied is 30 GiB. Copies happened to the same disk (that's what happens with my backup benchmarks too). The filesystem used was ext4.
A wishlist (I cannot open a wishlist bug), make size of files variables and allow various directories levels
-t range-size (-t 10k-50k) -b range-size --depth number-of-dirs depth --max-count-dir max number of directories by directory
Regards,
Have you considered testing via a tmpfs? You can then just umount when done, which will complete almost instantly.
As for copying, you might consider one of the many transparent overlay filesystems, which would again allow more-or-less instantaneous operation.
Javier, that would probably be a good addition to genbackupdata. However, I'm not needing it particularly much myself, so unlikely to add it anytime soon. Patch most welcome, though!
Josh, those would make the benchmarking setup be faster, but also less realistic. I'd like the benchmarks not be too far away from reality, and backing up from or to a tmpfs does not seem a particularly common use case. Overlay filesystems or lvm copy-on-write setups would likewise affect how the filesystem acts underneath obnam, and I'd like to avoid those complications when considering what to optimize next. But for other situations, they're definitely a good thing to consider.
Hello,
this is a bit late comment, but in case you still have your benchmarks ready, could you try
rsync -a –delete
i.e. sync an empty directory with directory you want to delete.
According to this guy: http://linuxnote.net/jianingy/en/linux/a-fast-way-to-remove-huge-number-of-files.html this is a lot faster than traditional "rm" or "find" alternatives.