/2015-02-05-toronto

This version of the Etherpad has been reorganized slightly so that notes are in the order of the workshop (day 1 first, then day 2). If you're confused and would like to revert to the way the Etherpad was at the end of the workshop, click here:
https://etherpad.mozilla.org/ep/pad/view/2015-02-05-toronto/SKn551h3Iz

General notes:

Day 1

The Unix Shell

Files for "automating with the Unix Shell" session: https://bit.ly/swcbash

- command line still useful for repetitive tasks (easier and faster), and can reproduce/reuse the scripts more easily than using a GUI (graphical user interface)
- shell = a program that interprets what we tell the computer to do, and what the computer sends back e.g. whoami - shell interpreted the command, sent it to the computer, and interpreted the reply
- the windows cmd line interface doesn't share many of the same commands as bash and bash is more universal, so using bash
- echo $SHELL - which display currently using
- case matters!
- if command doesn't exist, get a response "command not found"

- if want to act on files or folders, have to move to where they are
- with GUI can see more easily where you are, but with bash use pwd which means print working directory
- cd = change directory
- cd .. = return to previous directory
- ls = list contents of directory
    flags:
        -F = shows directories and files
        -a = shows all files
        -l = shows files one per line

(Windows tip: can embiggen the window and font by left-clicking on the upper-left corner and choosing properties)

- whenever give a command, can give it options - known as "flags" in bash e.g ls -F which shows which are directories and which are files

- on Mac & Linux, can use man __ where __ is the command you want to know about to get details on it. On Windows, have to google OR do __ --help where ___ is the command you want to know about (sadly, not all commands have a --help option)

- mkdir = create directory. Won't get a response from bash; it just does it
- nano *.txt = either open file with any legal name and the extension .txt in nano, or if file doesn't exsit, create file with that name
- cat = shows you what's in the file
- rm *.* = delete that file. rm * = delete files in the current directory (but not directories). rm *.txt = deletes only text files. Be careful: no undo! Can't do directories; only works on files. Have to use rmdir ___ where ____ is the name of the directory. This will only work if the directory is empty (this is protection to make sure you don't accidentally delete files)
- if have multiple flags, can run them together e.g. -ir or separately -i -r
rm -r = removes nested directories (like rmdir)
    -i - prompt before removal
mv - to rename a file or move it

. = here
e.g. ../molecules/. moves you up one directory to molecules, then inside the molecules directory
cp = to create a copy
    -r copy nested directories too
    * copy everything
wc = to get the number of words, lines and characters in a file
wc * = to get number of words, lines and characters for multiple files
    -l = just number of lines
    > *.txt = put the contents into a file named *.txt
sort -r reverses the sort order
sort lengths.txt > sorted-lengths.txt = sort lengths and save it to a new file called sorted-lengths.txt
head -1 *.* = print the first line of file *.*
sort lengths.txt | head -1 = sort lengths and then return the first line. | "pipes" the results of one function to another
wc -l * | sort | head -1 = does a bunch of the above commands "on the fly" without creating a file
BREAK

If lose the $ can use CTRL-C to kill the process
cp *.dat backup-*.dat = get a request for more clarification
- want to know what's happening? Use the echo command
- refer to variables by preceding them with a $
- if type in for filename in * see the > symbol to indicate bash is still expecting input
- syntax for...in...do e.g.
        for filename in *
        do
        echo $filename
        done
            OR
        for filename in *; do; echo $filename; done
- using a variable:
    for filename in basilisk.dat unicorn.dat
    do
    cp $filename backup-$filename
    done
- rm backup-*
- another example:

for filename in * do echo$filename wc -l $filename > $filename-length.txt echo "done processing" done

TO GET first three lines of each file

for filename in *

do 

echo $filename

head -3 $filename

echo "done processing"

done

- can create *.sh files in nano and run them by typing bash *.sh where * is the filename
- can't run in alt directory eg. bash gethead.sh ../molecules/ since it only acts on whatever's in the *.sh file. Within the *.sh file, you'd need to have something like for filename in $1 which replaces $1 with the name of a single/first filename you give it e.g. bash gethead.sh basilisk.dat To do it with EVERY file, replace the $1 with $*

comments in bash begin with #
- history gives you your history; could do something like history | tail -3 > history_stuff.sh
- !___ where ____ is the history linenumber -> reruns that line

LUNCH

A short reference of bash commands: http://software-carpentry.org/v5/novice/ref/01-shell.html

You can also review today's material by going over the whole lesson: http://software-carpentry.org/v5/novice/shell/index.html

Programming with R

R is a programming language designed to do statistics

----------------------------------------------------------------------
Data files: http://files.software-carpentry.org/r-novice-bio_data.zip
----------------------------------------------------------------------

getwd() -> prints the current working directory (like pwd in bash)

setwd('some/url') -> changes the current working directory (like cd in bash)

list.files() -> prints a list of files in the current working directory (like ls in bash)

read.csv(file='filename') -> loads a CSV file (comma-separated values)

head(read.csv(file='filename')) -> only prints the first few lines of that file

To get help on a function:
?functionname, e.g.?head

Assign a value to a variable:
weight_kg <- 55

Access the value of a variable:
weight_kg       ----> you don't need to put a $ in front like in bash

Operations on variables:
weight_lb = weight_kg * 2.2

Warning: if you update the value of weight_kg after the above operation, it will not update weight_lb unless you rerun weight_lb = weight_kg * 2.2. It's not like in Excel with cell references.

Assigning the contents of a file to a variable:
dat <- read.csv(file='filename')
---> dat will be in the format of a data frame

A data frame organises data much like an Excel spreadsheet, in lines and columns (2 dimensions)

names(dat) -> prints the names of the columns (R uses the header or the first line of the CSV to populate those names)
dat$ColumnName -> accesses the data in the column 'ColumnName'

Get a specific line from the data frame:
dat$ColumnName[linenumber]

To get the value of a cell on the 35th line and the 3rd column:
dat[35,3]

c operator -> concatenate

c(value1, value2, value3) -> creates a 1-dimensional array of

Logical operations on data sets:
Example
x <- c(5,4,7,10,15,13,17)

x > 10
prints out
FALSE FALSE FALSE FALSE TRUE TRUE TRUE

To get only the values that satisfy the operation:
x[x>10] prints out 15, 13, 17

CHALLENGE:
element <- c('o','x','y','g','e','n')
#first 3 characters
    Solution:
    element[c(1,2,3)]
    or
    element[1:3]
    or
    head(element, -3)
    or
    element[c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE)]

#Last 3 characters
    element[4:6]

?what is element[-1]

You can omit one index:
dat[,3] -> all the rows in the 3rd column

The way R stores data in data frame is a bit special. It uses *factors*
- lists the possible values that the data can take (called *levels*)
- refers to the possible values

levels(dat$Column) -> lists the possible values in that column
e.g.
levels(dat$Gender)
[1] "F" "M" "f" "m"

food <-factor(c("low", "high", "medium", "medium", "high"))

I can specify the order of the factors:
food <-factor(c("low", "high", "medium", "medium", "high"),ordered=TRUE,levels=c("low", "medium", "high"))

min(food)
    -> prints "low"

Warning! If you are dealing with factors and want to treat them as numbers, strange stuff will happen because of the way factors are stored.

table(dat$Column) prints the number of time each factor is present
barplot(table(dat$Column)) displays a bar plot using those numbers

Assignments can be mixed with indexing operators:
x[x>10] <- 20
        ---> will replace all values of x that are greater than 10 with the value 20

Challenge

Combine indexing and assignment to correct the Gender column so all values are uppercase
dat$Gender[dat$Gender == 'f'] <- 'F'
dat$Gender[dat$Gender == 'm'] <- 'M'

Problem, R has not updated the levels, still thinks the Gender values could be either F, f, M or m.
If you try barplot(table(dat$Gender))
--> you get a plot with empty "f" and "m" bars

To update the levels:
dat$Gender <- factor(dat$Gender)

Other approach, update the levels directly:
levels(dat$Gender) [2] <- 'F'
                                 ^--- this indexes the second item in the levels, which is 'f'
This approach has the advantage of updating the levels at the same time.

Get a quick summary of a series of columns:
summary(dat[,6:9])    ---> only selects the columns 6 to 9 of our data

To repeat an operation on a series of columns, we use *apply*

?apply to get more info

apply(dat[,6:9],2,mean)
          ^    ^     ^-------- the name of the function I want to run
        |     +------------- I want to apply the function column-wise
    +------------------ I want to apply the function to the columns 6 to 9


max_aneurism_by_patient <- apply(dat[,6:9],1,max)

To plot this data: plot(max_aneurism_by_patient) --> scatterplot by default

We can define what data to be plotted on each axis. Eg. to plot blood pressure in relation to the max aneurism by patient:
plot(x<-dat$BloodPressure, y<-max_aneurism_by_patients)

R will attempt to determine what's the best plot to display based on the nature of the data.
For example, if we want to plot the max aneurism by patient in relation to the patients' group (control, treatment 1, treatment 2):

plot(x<-dat$Group, y<-max_aneurism_by_patients)

R will display a box plot

To define a function:

fahr_to_kelvin <- function(temp) { # Converts tem in fahrenheit to kelvin <-- remember to add comments!! kelvin <- ((temp - 32) * (5/9)) + 273.15 return(kelvin) }

To call a function:

fahrenheit_to_kelvin(32)

Trying to call a function without the brackets will display the code for this function. Try inputting

apply

without the brackets --> it will display the code of the apply function.

CHALLENGE
Answer:

fence <- function(wrapper,origin) { #dont forget to comment it output <- c(wrapper, origin, wrapper) return(output) }

A function can either be called by passing the arguments in the order they are expected
> fence("***", "My text")
[1] "***"     "My text" "***"
or by explicitly setting them by name:
> fence(origin="my text", wrapper="**")
[1] "**"      "my text" "**"

A default value can be specified when defining the function. e.g.
center <- function(data, desired=0)
{
    new_data<- (data-mean(data)) + desired
    return(new_data)
}

---> if the value for "desired" is omitted when calling the center function, the default value of 0 will be used.

> center(c(1,2,3))
[1] -1 0 1
> center(c(1,2,3),100)
[1] 99 100 101

analyse <- function(fname) { dat<-read.csv(fname,header=TRUE) aneurisms_per_patient<-apply(dat[,6:9],1,max) plot(x=dat$Group, y=aneurisms_per_patient) }

for(value in c('Site-01.csv', 'Site-02.csv', 'Site-03.csv')) { analyse(value) }

If you want to keep learning, I suggest this free course from Code School: http://tryr.codeschool.com/ Have Fun!!

Day 2

REGULAR EXPRESSIONS

Please download http://files.software-carpentry.org/bibliography.csv

Please go to http://regexpal.com/ and copy the lines below into the bottom text area.

http://www.asciitable.com/

Baker 1 2009-11-17      1223.0
Baker 1 2010-06-24      1122.7
Baker 2 2010-08-25      2971.6
Baker 1 2011-01-05      1410.0
Baker 2 2010-09-04      4671.6
Davison/May 23, 2010/1724.7
Pertwee/May 24, 2010/2103.8
Davison/June 19, 2010/1731.9
Davison/July 6, 2010/2010.7
Pertwee/Sept 3, 2010/4981.0

Regular expression is a pattern that matches some text
[] = give me the set of something inside the brackets e.g [0123456789] would give all digits, as would [0-9]
If you're matching, you need to know what language you're using (problem with languages like Turkish - do research)
e = match each letter separately
__+ = match one or more sets of __ e.g e+ = e and ee and [0-9]+ would give you 1 or 122354

Regular expressions are used EVERYWHERE. Google basically does really sophisticated regex searches and sells us the results
Regex are (a) a lot faster and (b) portable between programming languages and platforms. You can even use it in MS Word to do a search-and-replace
[0-9]+-[0-9]+-[0-9]+ would give you ISO-formatted dates but also phone numbers
{*} where * is a number = exactly that number of digits e.g. [0-9]{4}-[0-9]{2}+-[0-9]{2} would identify 2009-11-17 but not 416-999-6647
    {*,} means * number of characters or more
Pull all dates in a May 23, 2010 format: [A-Z-a-z]+ [0-9]+, [0-9]{4}
Regular expressions can't understand semantics, so can't distinguish May 99, 9999 as an invalid date. Can just pull patterns and then validate in a programming language
Must use [A-Z-a-z] instead of [A-z] since there are a number of symbols between the upper and lower case alphabets which otherwise get included in the search: http://www.asciitable.com/
| = OR
    ak would only pull those two letters together, in order. a|k lets you look for a OR k
    May|June|July will look for May OR June OR July. Letters together are evaluated together before OR
Get both kinds of dates:
([0-9]{4}-[0-9]{2}-[0-9]{2})|([A-z][a-z]+ [0-9]+, [0-9]{4})
For readability, at less than a gig of data, would break this into two regular expressions. But more than that and an optimized (i.e. one line) expression
? = whatever's right in front of it is optional e.g. ak? means must have a but may have k
^ can mean line break and would let you match the place where the line starts (but most check the box to match ^$ at line breaks
^[A-z][a-z]+( [0-9][0-9])? would give you the names plus the optional numbers
^ *A-z][a-z]+( [0-9][0-9])? The initial space asterisk combo will catch one or more spaces
\d means not just [0-9] but "my language's characters for digits i.e. will work with any language set
\w
. matches any single character. If want to indicate a period, need to write \.
\ means "take anything after me literally e.g. \[ means actually look for a left square bracket
The more general we make them, the slower they will be to run
Note to sellf: this didn't work? (\d+\.\d+*)|(\d+*\.\d+)|

BREAK

grep = show what follows e.g. grep -E 'e{2,}' in bash
    -E = what follows is a regular expression (within single quotes)
Data tip: when designing/choosing how to name or organize data that will need to be searched or analyzed, all things being equal choose a format that will work with regex
sed -e 's/Baker/Capaldi/g' data.txt
sed = stream editor
    -e = what follows is a regular expression (within single quotes)
    s = substitute
    g = globally (everywhere) - if there are multiple occurrences on each line

Memory has been growing faster than our ability to process it, so tools that let us parse pieces of it can be useful (see Amdal's ratio)
Aside: University of Quebec's supercomputer is used to heat the entire university in the winter (!!!)

PYTHON

How many times each author has written: name, number of papers

cat = "Debbie" creates a variable named cat that points to the value "Debbie"
cat = "Henri" now points the variable to the value "Henri"
print len(cat) will give you the length of the value in the variable
MUST indent the body of a loop

exit with CTRL D (or CTRL Z and RETURN in 3.4)
Opening a file in a program is a connection, not actually

Note that python starts to count from 0 (not 1), i.e. the first element in
>> cat = 'Henry'
can be addressed with
>> print cat[0]
H

(bunch of missed stuff)

reader = open ('filename.*, 'r') where 'r' means that we're reading (not writing) to the file
to close, reader.close()

If you want to run and rerun a bunch of python code, put it in a file instead of retyping it every time.

Save the following into a file called bib.py:

reader = open('bibliography.csv', 'r') # open the file for reading for line in reader: # cycle through the lines parts = line.split(',') # break the line into pieces where there are commas print parts[2] # print column 2 (which is the third, because programmers)

The above will open the file and get you the 4-digit year prints part[3] will give you the fourth column i.e. author

Then run the bit of code by typing
python bib.py
in a shell
-> it should execute the contents of bib.py using python and print out the results.
If it doesn't work, check that bib.py is in the same directory as the csv file and that you are running python inside the same directory

You can use the output of python in a pipeline (in the shell), e.g.
python bib.py | sort | uniq -c
uniq = gets rid of adjacent duplicate lines. If you run with a sort, will get rid of all duplicate years
-c = a count of how many duplicates

what if a field has a comma inside it? Can be a problem in a comma-delineated file! Fortunately there's a Python module (or library), csv.reader, which can parse them. To use it, must import and then use the data

In Python, bits of code that can be used and reused within your code are called "libraries"

import csv raw = open('bibliography.csv', 'r') cooked = csv.reader(raw) for line in cooked: print line[3]

Note that he went to Google to check the csv problem. Very useful for finding existing libraries (bits of code) that others have written to handle a problem you may have.

# Get author names from bibliography file, on per line. import csv raw = open('bibliography.csv', 'r') cooked = csv.reader(raw) for line in cooked: all_authors = line[3] # this part of the loop reads the 4th column in each line authors = all_authors.split(';') for a in authors: # this part of the loop prints the individual authors print a.strip() # the .strip removes any spaces at the beginning or the end of a

To further process the output of this script, you can use regular bash commands using pipes:

python bib.py | sort | uniq -c

If you often have to handle large spreadsheet of messy data, you can play around with the scripts discussed this morning, or try http://openrefine.org (used to be called Google Refine). Very powerful tool to handle datasets, many more options than Excel for filtering.

LUNCH

Playing with open data (using Python)

# First version of reading data from the web import urllib url = 'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/AUS.csv' reader = urllib.urlopen(url) for line in reader: print line

If you're using Python 3, the syntax to use urllib is a bit different, use

import urllib.request
reader = urlllib.request.urlopen(url)

instead

There is a way in R to open info from the web as well, though the syntax will be a bit different

# First version variation 1 of reading data from the web: this turns it from csv-formatted text into Python lists with sets of character strings import urllib import csv #added url = 'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/AUS.csv' raw = urllib.urlopen(url) cooked = csv.reader(raw) #added for line in cooked: print line

# Version 2 import urllib import csv url = 'http://climatedataapi.worldbank.org/climateweb' + \ '/rest/v1/country/cru/tas/year/AUS.csv' raw = urllib.urlopen(url) cooked = csv.reader(raw) for line in cooked: year = int(line[0]) temp = float(line[1]) print(year, ':', temp)

...but this version doesn't take into account the first row which consists of text headers, sooo...

import urllib import csv url = 'http://climatedataapi.worldbank.org/climateweb' + \ '/rest/v1/country/cru/tas/year/AUS.csv' raw = urllib.urlopen(url) header = cooked.readline() #read the first line of the raw (getting it out of the way) cooked = csv.reader(raw) #process the rest of the file for line in cooked: year = int(line[0]) temp = float(line[1]) print(year, ':', temp)

# Version 3 - get the average temperature import urllib import csv url = 'http://climatedataapi.worldbank.org/climateweb' + \ '/rest/v1/country/cru/tas/year/BRA.csv' raw = urllib.urlopen(url) header = raw.readline() cooked = csv.reader(raw) total = 0.0 num_seen = 0 for line in cooked: year = int(line[0]) temp = float(line[1]) total = total + temp num_seen = num_seen + 1 print 'number of records processed', num_seen print 'the average temperature is', total / num_seen

# Version 4 - get the average temperature with the country import urllib import csv url = 'http://climatedataapi.worldbank.org/climateweb' + \ '/rest/v1/country/cru/tas/year/BRA.csv' raw = urllib.urlopen(url) header = raw.readline() cooked = csv.reader(raw) total = 0.0 num_seen = 0 for line in cooked: year = int(line[0]) temp = float(line[1]) total = total + temp num_seen = num_seen + 1 print 'number of records processed', num_seen print 'the average temperature is', total / num_seen

# Version 4: can more easily replace the country name and print it as part of the results import urllib import csv country = 'NER' url = 'http://climatedataapi.worldbank.org/climateweb' + \ '/rest/v1/country/cru/tas/year/{0}.csv' actual = url.format(country) raw = urllib.urlopen(actual) header = raw.readline() cooked = csv.reader(raw) total = 0.0 num_seen = 0 for line in cooked: year = int(line[0]) temp = float(line[1]) total = total + temp num_seen = num_seen + 1 print 'country is', country print 'number of records processed', num_seen print 'the average temperature is', total / num_seen

sys = how you get things off the command line and into the code

What do you count as a particular region? Affects the data. Russian

# Version 5 (getting country code from the command line) import urllib import csv import sys country = sys.argv[1] url = 'http://climatedataapi.worldbank.org/climateweb' + \ '/rest/v1/country/cru/tas/year/{0}.csv' actual = url.format(country) raw = urllib.urlopen(actual) header = raw.readline() cooked = csv.reader(raw) total = 0.0 num_seen = 0 for line in cooked: year = int(line[0]) temp = float(line[1]) total = total + temp num_seen = num_seen + 1 print 'country is', country print 'number of records processed', num_seen print 'the average temperature is', total / num_seen

The above script, however, only works with a single country, e.g. by invoking the program with
python weather.py CAN

What if we want to be able to process multiple countries?

# Version 6: process multiple countries from the command line + uses a function # Note: result for CAN on 2015-02-06 was -7.42467926655 import urllib import csv import sys URL = 'http://climatedataapi.worldbank.org/climateweb' + \ '/rest/v1/country/cru/tas/year/{0}.csv' def get_data(where): actual = URL.format(where) raw = urllib.urlopen(actual) header = raw.readline() cooked = csv.reader(raw) total = 0.0 num_seen = 0 for line in cooked: year = int(line[0]) temp = float(line[1]) total = total + temp num_seen = num_seen + 1 ave_temp = total / num_seen print '{0},{1}'.format(where, ave_temp) # Now loop over all the countries. countries = sys.argv[1:] print 'country,ave surface temp (C)' for country in countries: get_data(country)

Note that this version of the script wraps the code that grabs the info from the web and computes the average temperature inside a function called get_data(where).

The second part of the script then goes through all countries passed on through the command line (sys.argv[1:] takes all arguments from the 2nd one on - the 1st one is the name of the script itself) and calls the get_data() function for each country.

can then export into a document using > *.csv where * is whatever you want to call the file the data will be written to

RECAP:
1. 7 plus or minus two: reason for not having too long strings of code: better to divide into chunks to name (and remember) the chunk instead of all the individual bits
2. Code for 45-50 minutes, take a break before revisiting
3. Get lots of sleep
4. Work with tools that are already there and work best for your needs (e.g. R for stats, Python for text)

THINGS TO LOOK AT:
Bad Data
Open Refine: openrefine.org - data cleanup tool

Random tip: giving out three sticky notes in each meeting. Each person takes away a sticky note when they speak, and when they run out can't speak again until everyone else has used at least one sticky note (Tool for making sure everyone gets to speak)

Thomas made a version of this script that goes through all countries, using a CSV file found here:
https://gis.stackexchange.com/questions/1047/full-list-of-iso-alpha-2-and-iso-alpha-3-country-codes
The file is available here: http://geohack.net/gis/wikipedia-iso-country-codes.csv

You can run my code by executing it in the directory where you downloaded the above file:
python weather.py wikipedia-iso-country-codes.csv > average_temperatures.csv

import urllib import csv import sys URL = 'http://climatedataapi.worldbank.org/climateweb' + \ '/rest/v1/country/cru/tas/year/{0}.csv' def loop_countries(file): raw = open(file, 'r') cooked = csv.reader(raw) for line in cooked: country=line[2] get_data(country) def get_data(where): actual = URL.format(where) raw = urllib.urlopen(actual) header = raw.readline() cooked = csv.reader(raw) total = 0.0 num_seen = 0 for line in cooked: year = int(line[0]) temp = float(line[1]) total = total + temp num_seen = num_seen + 1 if num_seen > 0: ave_temp = total / num_seen print '{0},{1}'.format(where, ave_temp) # Load the countries data file countryfile = sys.argv[1] print 'country,ave surface temp (C)' loop_countries(countryfile)