During lunch the other day, I discussed the shortcomings of the tar file format with friend and co-worker Daniel. The tar file format has a lot of legacy by now, and it's not quite up to date with the latest developments in file systems, such as extended attributes. This makes tar badly suited for things such as backups and other situations where precise reproduction of the input data matters.

There are several variants of the tar file format, and various more or less standard extensions to it. GNU tar, for example, added support for pathnames longer than 100 bytes many years ago, and it is now commonly supported.

Other problems in the tar file format:

  • It has no native support for compression. The Unix Way is to use an external compressor, which is nice, but it makes it necessary to decompress the entire file to get a list of its contents. For large archives, this is very time consuming.
  • Even when uncompressed, the file format works badly for some kinds of operations, such as deleting files from the archive, or updating them with new versions.
  • The file format is entirely linear. When creating a tar file, it would sometimes be possible to write data from multiple sources at the same time, perhaps compressing them separately, maybe with file type specific compressors. With a linear format, this is not possible without spooling some files into temporary files. An interleaved format, similar to multimedia files, which mix audio and video data into a single stream, would make it possible to be more efficient at writing.
  • The supported meta data for files is limited, and it's hard to extend the support without breaking the file format.

This led us to discuss the possibility of a new file format. We had a bit of fun exploring the solution space for a while.

However, almost all use of tar these days is for distributing sets of files, where the filename and basic set of file permissions is enough. In other words, for things such as source code, tar is just fine. The archives are small enough, and the other limitations are rarely a problem, but the pain of switching to a new format would be great. Thus, with some reluctance, we concluded that a new format would be a waste of time.

But I thought I'd write this up anyway, in case one of my readers wants to start working on this.

http://duplicity.nongnu.org/new_format.html may propose interesting options ?
Comment by obergix Mon Sep 3 20:27:13 2012
Zip is making the occasional comeback in this space. What is your evaluation of that?
Comment by Peter Mon Sep 3 21:03:32 2012
It strikes me that a better approach (though certainly much harder) is to remove the need for archive files in the first place, making everything able to handle directories seamlessly.
Comment by eythian Mon Sep 3 21:12:00 2012

Although I agree that a new format should come out sooner or later, I can see the other side, too. Tar is Tape ARchive. It is designed to create sequential archives to be written on a tape, and as I see, tapes are still widely used for backup in large organizations, and they are huge in size (hundreds of gigabytes per casette or so. Maybe I'm out of date a bit, or I'm too optimistic). It's not my task to decide if this is a good thing at all, but until this is the case, tar has its place in the world.

For everything else, there should be numerous alternatives. Some archivers/compressors exist that provide a native Linux interface (e.g p7zip, arj, rar). They should start supporting the features you call for, although some of these features are UNIX-, or mostly, Linux-only. So why should we reinvent the wheel? It's already out there, we just need to make it a bit better.

Comment by Gergely Mon Sep 3 21:16:14 2012

How about just using a compressed filesystem? Like btrfs?

If it could chomp off all the unused space at the end of the archive then you'd have all the best of everything and you could completely access files in a archive without having to even extract it first.

Comment by Jonathan Mon Sep 3 22:59:30 2012


It's very good! At least you can extract any file without reading the whole archive.

Comment by Dmitrijs Tue Sep 4 08:59:38 2012
The Dar (Disk ARchive) backup program defines its own file format.
Comment by Nicola Thu Oct 25 22:07:43 2012