Simple things that work well together.
We're going to start using these building blocks into power tools. To see how much you can do, we need some a real problem to work on. Let's get some data!
Download the data into your bootcamp directory.
You can use your desktop tools to unzip it, or try out the shell command unzip
.
unzip shelldata.zip
Once you have the data, take a look at the files with cd, ls and cat...
Ack. Sometimes we want to look at files that are long without them scrolling off the screen. Try out one or more of these. Discuss with your neighbors.
head -5 <filename>
head -50 <filename>
tail -5 <filename>
tail -20 <filename>
more <filename> # use q to exit
less <filename> # use q to exit
We'll start with a the data in data/molecules
: structure data for some organic molecules. The .pdb
extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule. We're going to use the wc
command ('word count') to count the lines in each file.
Use cd
to go into data/molecules
and run the command wc *.pdb
.
cd molecules
wc *.pdb
20 156 1158 cubane.pdb
12 84 622 ethane.pdb
9 57 422 methane.pdb
30 246 1828 octane.pdb
21 165 1226 pentane.pdb
15 111 825 propane.pdb
107 819 6081 total
Wildcards
*
is a wildcard. It matches any set of characters, or none.
?
is also a wildcard, but it matches a single character of any kind.
[a-c]
matches a single character: a or b or c.Commands like
wc
andls
never see the wildcard characters, just what those wildcards matched. This is another example of orthogonal design.
When we run wc *.pdb
, the shell first expands *.pdb
into a complete list of the .pdb
files. We can see the list by typing echo *.pdb
. The shell then runs the wc
command on each file in the list, and prints out the result: the lines, words and characters in each file.
*.pdb # ends with .pdb
p*.pdb # starts with p, ends with pdb
p*. # starts with p, ends with .
[pe]*.pdb # starts with p or e, ends with .pdb
p?.pdb # pi.pdb, p5.pdb...
p*.p?* # p(any).p(one)(any) -- matches plum.pi or p.print, but not p.list or plum.p
Use wildcards to list files that
We still want to know the number of lines in the files. wc
takes options -l to show just the number of lines, but the list is clearly too long to scan for the smallest number. How can we make the computer do this?
Our first step toward a solution is to run the command:
$ wc -l *.pdb > lengths
The >
tells the shell to redirect the command's output to a file instead of printing it to the screen. The shell will create the file if it doesn't exist, or overwrite it if it does.
This is why there is no screen output: the wc
output has gone into the file lengths
instead.
$ ls lengths
lengths
head lengths
Great! Now that the lengths are in a file, we can sort them:
$ sort lengths
Oh. the output went to the screen, and the lengths file isn't changed. Let's capture the output in another file:
$ sort lengths > sorted-lengths
Then we can use head -1
to get the shortest file:
$ head -1 sorted-lengths
Fortunately, the shell gives us a tool for combining these commands. It's |
, and it's called a pipe. It allows the shell to redirect the output of a command to the input of the next one, effectively creating a pipeline that our data can flow through, with different processing at each stage. We can use it to combine the three commands wc
, sort
, and head
in one line:
$ wc -l *.pdb | sort | head -1
Every time we run a command, the computer creates a process in memory to do the work. Every process has an input channel called standard input, stdin
, and also an output channel called standard output stdout
. Normally, the the input channel for the shell is the keyboard, and the output channel is the screen. The >, >>, < and | characters on the command line tell the shell to get the input (<,|) or send the (>, >>, |) someplace else.
When we run wc -l *.pdb > lengths
, the computer to creates a new process. Then wc
reads from the input files (given in the command). We've also used >
, so instead of sending the output to the default stdout (the screen), it sends it to the file we specify.
When we run a pipe, like wc -l *.pdb | sort
, the shell creates two processes (one for each process in the pipe) so that wc
and sort
run simultaneously. The standard output of the first process (wc
) is fed directly to the standard input of the next (sort
). This can continue down a series of pipes. With the command wc -l *.pdb | sort | head -1
, we get three processes with data flowing from the files, through wc
to sort
, and from sort
through head
to the screen.
Almost all of the standard Unix tools can work this way: unless told to do otherwise, they read from standard input, do something with what they've read, and write to standard output.
This simple idea is why Unix has been so successful. The key is that any program that reads lines of text from standard input and writes lines of text to standard output can be combined with every other program that behaves this way. You can and should write your programs this way so that you and other people can put those programs into pipes to multiply their power.