Nelle Nemo, a marine biologist, has just returned from a six-month survey of the North Pacific Gyre, where she has been sampling gelatinous marine life in the Great Pacific Garbage Patch. She has 1520 samples in all, and now needs to:
goostat
.goodiff
.It takes about half an hour for the assay machine to process each sample. The good news is, it only takes two minutes to set each one up. Since her lab has eight assay machines that she can use in parallel, this step will "only" take about two weeks.
The bad news is that if she has to run goostat
and goodiff
by hand, she'll have to enter filenames and click "OK" 45,150 times (300 runs of goostat
, plus 300×299/2 runs of goodiff
). At 30 seconds each, that will take more than two weeks. Not only would she miss her paper deadline, the chances of her typing all of those commands right are practically zero.
The next few lessons will explore what she should do instead. More specifically, they explain how she can use a command shell to automate the repetitive steps in her processing pipeline so that her computer can work 24 hours a day while she writes her paper. As a bonus, once she has put a processing pipeline together, she will be able to use it again whenever she collects more data.
Knowing just this much about files and directories, Nelle is ready to organize the files that the protein assay machine will create. First, she creates a directory called north-pacific-gyre
(to remind herself where the data came from). Inside that, she creates a directory called 2012-07-03
, which is the date she started processing the samples. She used to use names like conference-paper
and revised-results
, but she found them hard to understand after a couple of years. (The final straw was when she found herself creating a directory called revised-revised-results-3
.)
Nelle names her directories "year-month-day", with leading zeroes for months and days, because the shell displays file and directory names in alphabetical order. If she used month names, December would come before July; if she didn't use leading zeroes, November ('11') would come before July ('7').
Each of her physical samples is labelled according to her lab's convention with a unique ten-character ID, such as "NENE01729A". This is what she used in her collection log to record the location, time, depth, and other characteristics of the sample, so she decides to use it as part of each data file's name. Since the assay machine's output is plain text, she will call her files NENE01729A.txt
, NENE01812A.txt
, and so on. All 1520 files will go into the same directory.
If she is in her home directory, Nelle can see what files she has using the command:
$ ls north-pacific-gyre/2012-07-03/
She can use the tab key to cut down on typing.
Nelle has run her samples through the assay machines and created 1520 files in the north-pacific-gyre/2012-07-03
directory described earlier. As a quick sanity check, she types:
$cd north-pacific-gyre/2012-07-03$ wc -l *.txt
The output is 1520 lines that look like this:
300 NENE01729A.txt
300 NENE01729B.txt
300 NENE01736A.txt
300 NENE01751A.txt
300 NENE01751B.txt
300 NENE01812A.txt
... ...
Now she types this:
$ wc -l *.txt | sort | head -5
240 NENE02018B.txt
300 NENE01729A.txt
300 NENE01729B.txt
300 NENE01736A.txt
300 NENE01751A.txt
Whoops: one of the files is 60 lines shorter than the others. When she goes back and checks it, she sees that she did that assay at 8:00 on a Monday morning—someone was probably in using the machine on the weekend, and she forgot to reset it. Before re-running that sample, she checks to see if any files have too much data:
$ wc -l *.txt | sort | tail -5
300 NENE02040A.txt
300 NENE02040B.txt
300 NENE02040Z.txt
300 NENE02043A.txt
300 NENE02043B.txt
Those numbers look good—but what's that 'Z' doing there in the third-to-last line? All of her samples should be marked 'A' or 'B'; by convention, her lab uses 'Z' to indicate samples with missing information. To find others like it, she does this:
$ ls *Z.txt
NENE01971Z.txt NENE02040Z.txt
Sure enough, when she checks the log on her laptop, there's no depth recorded for either of those samples. Since it's too late to get the information any other way, she must exclude those two files from her analysis. She could just delete them using rm
, but there are actually some analyses she might do later where depth doesn't matter, so instead, she'll just be careful later on to select files using the wildcard expression *[AB].txt
. As always, the '*' matches any number of characters; the expression [AB]
matches either an 'A' or a 'B', so this matches all the valid data files she has.
Nelle is now ready to process her data files. Since she's still learning how to use the shell, she decides to build up the required commands in stages.
She needs to tell the shell to do something over and over again with a different file each time. Wildcards and tab completion won't do this, so she decides to develop a script with a loop. She's seen an example that displays the first three lines of each file in turn:
$ for filename in *.dat
> do
> head -3 $filename
> done
She decides to build on it. Her first step is to make sure that she can select the right files—remember, these are ones whose names end in 'A' or 'B', rather than 'Z':
$ cd north-pacific-gyre/2012-07-03
$ for datafile in *[AB].txt
do
echo $datafile
done
NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
...
NENE02043A.txt
NENE02043B.txt
Her next step is to decide what to call the files that the goostats
analysis program will create. Prefixing each input file's name with "stats" seems simple, so she modifies her loop to do that:
$ for datafile in *[AB].txt
do
echo $datafile stats-$datafile
done
NENE01729A.txt stats-NENE01729A.txt
NENE01729B.txt stats-NENE01729B.txt
NENE01736A.txt stats-NENE01736A.txt
...
NENE02043A.txt stats-NENE02043A.txt
NENE02043B.txt stats-NENE02043B.txt
She hasn't actually run goostats
yet, but now she's sure she can select the right files and generate the right output filenames.
Typing in commands over and over again is becoming tedious, though, and Nelle is worried about making mistakes, so instead of re-entering her loop, she presses the up arrow. In response, the shell redisplays the whole loop on one line (using semi-colons to separate the pieces):
$for datafile in *[AB].txt; do echo$datafile stats-$datafile; done
Using the left arrow key, Nelle backs up and changes the command echo
to goostats
:
$for datafile in *[AB].txt; do bash goostats$datafile stats-$datafile; done
When she presses enter, the shell runs the modified command. However, nothing appears to happen—there is no output. After a moment, Nelle realizes that since her script doesn't print anything to the screen any longer, she has no idea whether it is running, much less how quickly. She kills the job by typing Control-C, uses up-arrow to repeat the command, and edits it to read:
$for datafile in *[AB].txt; do echo$datafile; bash goostats $datafile stats-$datafile; done
Beginning and End
We can move to the beginning of a line in the shell by typing
^A
(which means Control-A) and to the end using^E
.
When she runs her program now, it produces one line of output every five seconds or so:
NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
...
1518 times 5 seconds, divided by 60, tells her that her script will take about two hours to run. As a final check, she opens another terminal window, goes into north-pacific-gyre/2012-07-03
, and uses cat stats-NENE01729B.txt
to examine one of the output files. It looks good, so she decides to get some coffee and catch up on her reading.
Those Who Know History Can Choose to Repeat It
Another way to repeat previous work is to use the
history
command to get a list of the last few hundred commands that have been executed, and then to use!123
(where "123" is replaced by the command number) to repeat one of those commands. For example, if Nelle types this:$ history | tail -5 456 ls -l NENE0*.txt 457 rm stats-NENE01729B.txt.txt 458 bash goostats NENE01729B.txt stats-NENE01729B.txt 459 ls -l NENE0*.txt 460 history
then she can re-run
goostats
onNENE01729B.txt
simply by typing!458
.
An off-hand comment from her supervisor has made Nelle realize that she should have provided a couple of extra parameters to goostats
when she processed her files. This might have been a disaster if she had done all the analysis by hand, but thanks to for loops, it will only take a couple of hours to re-do.
But experience has taught her that if something needs to be done twice, it will probably need to be done a third or fourth time as well. She runs the editor and writes the following:
# Calculate reduced stats for data files at J = 100 c/bp.
for datafile in $*
do
echo $datafile
goostats -J 100 -r$datafile stats-$datafile
done
(The parameters -J 100
and -r
are the ones her supervisor said she should have used.) She adds a comment line starting with # at the top, to help her remember what this script does.
She saves this in a file called do-stats.sh
so that she can now re-do the first stage of her analysis by typing:
$ bash do-stats.sh *[AB].txt
She can also do this:
$ bash do-stats.sh *[AB].txt | wc -l
so that the output is just the number of files processed rather than the names of the files that were processed.
One thing to note about Nelle's script is that it lets the person running it decide what files to process. She could have written it as:
# Calculate reduced stats for A and Site B data files at J = 100 c/bp.
for datafile in *[AB].txt
do
echo $datafile
goostats -J 100 -r$datafile stats-$datafile
done
The advantage is that this always selects the right files: she doesn't have to remember to exclude the 'Z' files. The disadvantage is that it always selects just those files—she can't run it on all files (including the 'Z' files), or on the 'G' or 'H' files her colleagues in Antarctica are producing, without editing the script. If she wanted to be more adventurous, she could modify her script to check for command-line parameters, and use *[AB].txt
if none were provided. Of course, this introduces another tradeoff between flexibility and complexity.