As part of my development of a backup application, I run benchmarks on it, and that means creating a test data set, running some backups, and then removing the test data set. One of the test data sets I use is a 140 gibibytes of my real data. The benchmark first copies the data to a temporary location.

In other words, a fair bit of my current life is spent waiting for files to be copied and removed. The faster that goes, the better.

Overnight, I ran a little benchmark on those operations, to compare a couple of ways to do them. The results are below:

  elapsed  cmd
      (s)
    107.2  rm -rf tmp/data
     98.4  find tmp/data -delete
    100.1  find tmp/data -exec rm -rf {} +
    116.2  find tmp/data -depth -print0 | xargs -0 rm -rf

  elapsed  cmd
      (s)
   3567.5  cp -a tmp/data tmp/copy
   3219.5  cd tmp && mkdir copy && tar -C data -cf - . | tar -C copy -xf -

It is surprising, but it's clear that find is significantly faster than rm in deleting files, by almost ten percent. Since performance is a feature, this would indicate that that feature in rm is buggy.

For file copying, the piping of two tars is a common trick, and it really is faster, again by almost ten percent.

Obviously, there might also be a problem with the benchmark. I attach the script, which uses benchmark-cmd in extrautils, which I wrote for this kind of thing. If there is a problem with the benchmark, don't hesitate to provide a patch to fix that.

There may be other ways to remove or copy files that should be compared, too. rsync? cpio? For file removal, a tool using Linux getdents directly would probably be faster than the portable code in GNU coreutils and findutils. Somebody should write that and compare.

In all cases above, the test data set to be removed or copied is 30 GiB. Copies happened to the same disk (that's what happens with my backup benchmarks too). The filesystem used was ext4.

you mention in your post that the data set is 30GiB in size, but how is that distributed? how deep is the dir hierarchy? how many files per dir? etc...
Comment by Juan Antonio Thu Sep 1 11:45:57 2011
That's a good point about the nature of the test data. The genbackupdata produces the data, and it puts creates a directory hierarchy that is three deep by default, and has 128 files per leaf directory, by default. Files are 16 KiB. See the source for details.
Comment by Lars Wirzenius Thu Sep 1 11:51:33 2011
Have you tried using 'cp -al' for your testing purposes? It is much faster that actually copying the data around.
Comment by Aigars Thu Sep 1 12:41:03 2011
Hardlinks aren't useful, when files are being modified in place. So yeah, I've thought of that obvious optimization, but it's not going to work.
Comment by Lars Wirzenius Thu Sep 1 13:00:20 2011

A wishlist (I cannot open a wishlist bug), make size of files variables and allow various directories levels

-t range-size (-t 10k-50k) -b range-size --depth number-of-dirs depth --max-count-dir max number of directories by directory

Regards,

Comment by Javier Thu Sep 1 14:58:45 2011

Have you considered testing via a tmpfs? You can then just umount when done, which will complete almost instantly.

As for copying, you might consider one of the many transparent overlay filesystems, which would again allow more-or-less instantaneous operation.

Comment by JoshTriplett Thu Sep 1 21:50:09 2011

Javier, that would probably be a good addition to genbackupdata. However, I'm not needing it particularly much myself, so unlikely to add it anytime soon. Patch most welcome, though!

Josh, those would make the benchmarking setup be faster, but also less realistic. I'd like the benchmarks not be too far away from reality, and backing up from or to a tmpfs does not seem a particularly common use case. Overlay filesystems or lvm copy-on-write setups would likewise affect how the filesystem acts underneath obnam, and I'd like to avoid those complications when considering what to optimize next. But for other situations, they're definitely a good thing to consider.

Comment by Lars Wirzenius Thu Sep 1 23:02:27 2011
Joey found http://linux.die.net/man/1/fastrm, part of inn, which may be a faster thing than calling rm.
Comment by Lars Wirzenius Mon Sep 5 07:50:04 2011

Hello,

this is a bit late comment, but in case you still have your benchmarks ready, could you try

rsync -a –delete

i.e. sync an empty directory with directory you want to delete.

According to this guy: http://linuxnote.net/jianingy/en/linux/a-fast-way-to-remove-huge-number-of-files.html this is a lot faster than traditional "rm" or "find" alternatives.

Comment by guestguest Tue Jul 10 13:39:47 2012