Pipes and Filters

Simple things that work well together.

Objectives

Get the data

We're going to start using these building blocks into power tools. To see how much you can do, we need some a real problem to work on. Let's get some data!

Download the data into your bootcamp directory.

You can use your desktop tools to unzip it, or try out the shell command unzip.

    unzip shelldata.zip

Once you have the data, take a look at the files with cd, ls and cat...

Ack. Sometimes we want to look at files that are long without them scrolling off the screen. Try out one or more of these. Discuss with your neighbors.

head -5 <filename>
head -50 <filename>
tail -5 <filename>
tail -20 <filename>
more <filename>   # use q to exit
less <filename>   # use q to exit

We'll start with a the data in data/molecules: structure data for some organic molecules. The .pdb extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule. We're going to use the wc command ('word count') to count the lines in each file.

Use cd to go into data/molecules and run the command wc *.pdb.

cd molecules
wc *.pdb
  20  156 1158 cubane.pdb
  12   84  622 ethane.pdb
   9   57  422 methane.pdb
  30  246 1828 octane.pdb
  21  165 1226 pentane.pdb
  15  111  825 propane.pdb
 107  819 6081 total

Wildcards

* is a wildcard. It matches any set of characters, or none.

? is also a wildcard, but it matches a single character of any kind.

[a-c] matches a single character: a or b or c.

Commands like wc and ls never see the wildcard characters, just what those wildcards matched. This is another example of orthogonal design.

When we run wc *.pdb, the shell first expands *.pdb into a complete list of the .pdb files. We can see the list by typing echo *.pdb. The shell then runs the wc command on each file in the list, and prints out the result: the lines, words and characters in each file.

wildcard practice

*.pdb         # ends with .pdb
p*.pdb        # starts with p, ends with pdb
p*.           # starts with p, ends with .
[pe]*.pdb     # starts with p or e, ends with .pdb
p?.pdb        # pi.pdb, p5.pdb...
p*.p?*        # p(any).p(one)(any) -- matches plum.pi or p.print, but not p.list or plum.p

Challenge:

Use wildcards to list files that

  1. begin with e
  2. begin with s or t
  3. contain 'ane' somewhere

We still want to know the number of lines in the files. wc takes options -l to show just the number of lines, but the list is clearly too long to scan for the smallest number. How can we make the computer do this?

Our first step toward a solution is to run the command:

$ wc -l *.pdb > lengths

The > tells the shell to redirect the command's output to a file instead of printing it to the screen. The shell will create the file if it doesn't exist, or overwrite it if it does.

This is why there is no screen output: the wc output has gone into the file lengths instead.

$ ls lengths
lengths
head lengths

Great! Now that the lengths are in a file, we can sort them:

$ sort lengths

Oh. the output went to the screen, and the lengths file isn't changed. Let's capture the output in another file:

$ sort lengths > sorted-lengths

Then we can use head -1 to get the shortest file:

$ head -1 sorted-lengths

Fortunately, the shell gives us a tool for combining these commands. It's |, and it's called a pipe. It allows the shell to redirect the output of a command to the input of the next one, effectively creating a pipeline that our data can flow through, with different processing at each stage. We can use it to combine the three commands wc, sort, and head in one line:

$ wc -l *.pdb | sort | head -1

How pipes work

Every time we run a command, the computer creates a process in memory to do the work. Every process has an input channel called standard input, stdin, and also an output channel called standard output stdout. Normally, the the input channel for the shell is the keyboard, and the output channel is the screen. The >, >>, < and | characters on the command line tell the shell to get the input (<,|) or send the (>, >>, |) someplace else.

a shell process

When we run wc -l *.pdb > lengths, the computer to creates a new process. Then wc reads from the input files (given in the command). We've also used >, so instead of sending the output to the default stdout (the screen), it sends it to the file we specify.

When we run a pipe, like wc -l *.pdb | sort, the shell creates two processes (one for each process in the pipe) so that wc and sort run simultaneously. The standard output of the first process (wc) is fed directly to the standard input of the next (sort). This can continue down a series of pipes. With the command wc -l *.pdb | sort | head -1, we get three processes with data flowing from the files, through wc to sort, and from sort through head to the screen.

a shell pipe

Almost all of the standard Unix tools can work this way: unless told to do otherwise, they read from standard input, do something with what they've read, and write to standard output.

This simple idea is why Unix has been so successful. The key is that any program that reads lines of text from standard input and writes lines of text to standard output can be combined with every other program that behaves this way. You can and should write your programs this way so that you and other people can put those programs into pipes to multiply their power.

In []: