ETA, 2022-05-19: I'm happy this blog post has gathered a fair bit of interest. However, this post is as much effort as I'm prepared to put into the topic. I think it would be a good idea to write an essay, article, or even a book, on how syntax of the Unix command line has varied over the years, and in different subcultures. Something semi-scholarly with cited sources for claims, and everything. I'd be happy to see this post be used as a basis: the CC license makes that easy. However, such a project would be quite a bit of work that I'm not interested in doing, I'm afraid.
This blog post documents my understanding of how the conventions for Unix command line syntax have evolved over time. It's not properly sourced, and may well be quite wrong. I've not been using Unix until 1989, so I wasn't there for the early years. Maybe someone has written a proper essay on this, with citations. I'm too lazy to dig them up.
Early 1970s
In the beginning, in the first year or so of Unix, an ideal was formed for what a Unix program would be like: it would be given some number of filenames as command line arguments, and it would read those. If no filenames were given, it would read the standard input. It would write its output to the standard output. There might be a small number of other, fixed, command line arguments. Options didn't exist. This allowed programs to be easily combined: one program's output could be the input of another.
There were, of course, variations. The echo
command didn't read
anything. The cp
, mv,
and rm
commands didn't output anything.
However, the "filter" was the ideal.
$ cat *.txt | wc
In the example above, the cat
program reads all files with names
with a .txt
suffix, writes them to its standard output, which is
then piped to the wc
program, which reads its standard input (it
wasn't given any filenames) to count words. In short, the pipeline
above counts words in all text files.
This was quite powerful. It was also very simple.
Options
Fairly quickly, the developers of Unix found that many programs would
be more useful if the user could choose between minor variations of
function. For example, the sort
program could provide the option to
order input lines without consideration to upper and lower case of
text.
The command line option was added. This seems to have resulted in a bit of a philosophical discussion among the developers. Some were adamant against options, fearing the complexity it would bring, and others really liked them, for the convenience. The side favoring options won.
To make command line parsing easy to implement, options always
started with a single dash, and consisted of a single character.
Multiple options could be packed after one dash, so that foo -a -b
-c
could be shortened to foo -abc
.
If not immediately, then soon after, an additional twist was added:
some options required a value. For example, the sort
program could
be given the -kN
option, where N
is an integer specifying which
word in a line would be used for sorting. The syntax for values was a
little complicated: the value could follow the option letter as part
of the same command line argument, or be the next argument. The
following two commands thus mean the same thing:
$ sort -k1
$ sort -k 1
At this point, command line parsing became more than just iterating over the command line arguments. The dominant language for Unix was C, and a lot of programs implemented the command line parsing themselves. This was unfortunate, but at this stage the parsing was still sufficiently simple that most of them did it in sufficiently similar ways that it didn't cause any serious problems. However, it was now the case that one often needed to check the manual, or experiment, to find out how a specific program was to be used.
Later on, Wikipedia says 1980, the C library function getopt
was
written. It became part of the Unix C standard library. It implemented
the command line parsing described above. It was written in C, which
at that time was quite a primitive programming language, and this
resulted in a simplistic API. Part of that API is that if the user
used an unknown option on the command line, the getopt
function
would return a question mark (?
) as its value. Some programs would
respond by writing out a short usage blurb. This led to -?
being
sometimes used to tell a program to show a help text.
Long options
In the late 1970s Unix spread from its birthplace, Bell Labs, to other
places, mostly universities. Much experimentation followed. During the
1980s some changes to command line syntax happened. The biggest change
here was long options: options whose name wasn't just a single
character. For example, in the new X window system, the -display
option would be used to select which display to use for a GUI program.
Note the single dash. This clashed with the "clumping together" of
single character option. Does -display
mean which display to use, or
the options -d -i -s -p -l -a -y
clumped together? This depended on the
program and how it decided to parse the options.
A further complication to parsing the command line was that
single-dash long options that took values couldn't allow the value to
be part of the same command line argument. Thus, -display :0
(two words)
was correct, but it could not be written as -display:0
, because a
simple C command line parser would have difficulty figuring out what
was the option name and what was the option's value. Thus, what
previously might have been written as a single argument -d:0
now
became two arguments.
The world did not end, but a little more complexity had landed in the world of Unix command line syntax.
The GNU project
The GNU project was first announced in 1983. It was to be an operating system similar to Unix. One of the changes it made was to command line syntax. GNU introduced another long option syntax, I believe to disambiguate the single-dash long option confusion with clumped single-character options.
Initially, GNU used the plus (+
) to indicate a long option, but
quickly changed to a double dash (--
). This made it unambiguous
whether a long option or clumped short options were being used.
I believe it was also GNU that introduced using the equals sign (=
)
to optionally add a value to a long option. Values to options could
be optional: --color
could mean the same as --color=auto
, but you
could also say --color=never
if you didn't like the default value.
GNU further allowed options to occur anywhere on the command line, not just at the beginning. This made things more convenient to the user.
GNU also wrote a C function, getopt_long
, to unify command line
parsing across the software produced by the project. I believe it
supported the single-dash long options from the start. Some GNU
programs, such as the C compiler, used those.
Thus, the following was acceptable:
$ grep -xi *.txt --regexp=foo --regexp bar
The example above clumps the short options -x
and -i
into one
argument, and provided grep
with two regular expression patterns,
one with an equals, and one without.
The GNU changes have largely been adopted by other Unix variants. I'm sure those have had their own changes, but I've not followed them enough to know.
GNU also added standard options: almost every GNU program supports the
options --help
, --version
, and --mail=ADDR
.1
Double dash
Edited to add: Apparently the double-dash was supported already in
about 1980 in the first version of getopt
in Unix System III. Thank
you to Chris
Siebenmann.
Around this time, a further convention was added: an argument of two
dashes only (--
) as a way to say that no further options to the
command being invoked would follow. I believe this was another GNU
change, but I have no evidence.
This is useful to, say, be able to remove a file with name that starts with a dash:
$ rm -- -f
For rm
, it was always possible to provide a fully qualified path,
starting from the root directory, or to prefix the filename with a
directory---rm ./-f
---and so this convention is not necessary for
removing files. However, given all GNU programs use the same function
for command line parsing, rm
gets it for free. Other Unix variants
may not have that support, though, so users need to be careful.
The double dash is more useful for other situations, such as when
invoking a program that invokes another program. An example is the
cargo
tool for the Rust language. To build and run a program and
tell it to report its version, you would use the following command:
$ cargo run -- --version
Without the double dash, you would be telling cargo
to report its
version.
Subcommands
I think at around the late 1980s, subcommands were added to the Unix
command line syntax conventions. Subcommands were a response to many
Unix programs gaining a large number of "options" that were in fact
not optional at all, and were really commands. Thus a program might
have "options" --decrypt
and --encrypt
, and the user was required
to use one of them, but not both. This turned out to be a little hard
for many people to deal with, and subcommands were a simplification.
Instead of using option syntax for commands, just require commands
instead.
I believe the oldest program that uses subcommand is the version
control system SCCS, from 1972, but I haven't been able to find out
which version added subcommands. Another version control system, CVS,
from 1990, seems to have had them the beginning. CVS was built on top
of yet another version control system, RCS, which had programs such as
ci
for "check in", and co
for "check out". CVS had a single
program, with subcommands:
$ cvs ci ...
$ cvs co ...
Later version control systems, such as Subversion, Arch, and Git, follow the subcommand pattern. Version control systems seem to inherently require the user to do a number of distinct operations, which fits the subcommand style well, and also avoids adding large numbers of individual programs (commands) to the shell, reducing name collisions.
Subcommands add further complications to command line syntax, though,
when inevitably combined with options. The main command may have
options (often called "global options"), but so can subcommands. When
options can occur anywhere on the command line, is --version
a
global option, or specific to a subcommand? Worse, how does a program
parse a command line? If an option is specific to a subcommand, the
parsing needs to know which subcommand, if only so it knows whether
the options requires a value or not.
To solve this, some programs require global options to be before the subcommand, which is easy to implement. Others allow them anywhere. Everything seems to require per-subcommand options to come after the subcommand.
Summary
The early Unix developers who feared complexity were right, but also wrong. It would be intolerable to have to have a separate program for every combination of a program with options. To be fair, I don't think that's what they would've advocated: instead, I think, they would've advocated tools that can be combined, and to simplify things so that fewer tools are needed.
That's not what happened, alas, and we live in a world with a bit more complexity than is strictly speaking needed. If we were re-designing Unix from scratch, and didn't need to be backwards compatible, we could introduce a completely new syntax that is systematic, easy to remember, easy to use, and easy to implement. Alas.
None of this explains dd
.
-
The
--email
bit is a joke.↩