Programming with R

Key Points

Analyzing Patient Data
  • Use variable <- value to assign a value to a variable in order to record it in memory.

  • Objects are created on demand whenever a value is assigned to them.

  • The function dim gives the dimensions of a data frame.

  • Use object[x, y] to select a single element from a data frame.

  • Use from:to to specify a sequence that includes the indices from from to to.

  • All the indexing and subsetting that works on data frames also works on vectors.

  • Use # to add comments to programs.

  • Use mean, max, min and sd to calculate simple statistics.

  • Use apply to calculate statistics across the rows or columns of a data frame.

  • Use plot to create simple visualizations.

Creating Functions
  • Define a function using name <- function(...args...) {...body...}.

  • Call a function using name(...values...).

  • R looks for variables in the current stack frame before looking for them at the top level.

  • Use help(thing) to view help for something.

  • Put comments at the beginning of functions to provide help for that function.

  • Annotate your code!

  • Specify default values for arguments when defining a function using name = value in the argument list.

  • Arguments can be passed by matching based on name, by position, or by omitting them (in which case the default value is used).

Analyzing Multiple Data Sets
  • Use for (variable in collection) to process the elements of a collection one at a time.

  • The body of a for loop is surrounded by curly braces ({}).

  • Use length(thing) to determine the length of something that contains other values.

  • Use list.files(path = "path", pattern = "pattern", full.names = TRUE) to create a list of files whose names match a pattern.

Making Choices
  • Save a plot in a pdf file using pdf("name.pdf") and stop writing to the pdf file with dev.off().

  • Use if (condition) to start a conditional statement, else if (condition) to provide additional tests, and else to provide a default.

  • The bodies of conditional statements must be surrounded by curly braces { }.

  • Use == to test for equality.

  • X & Y is only true if both X and Y are true.

  • X | Y is true if either X or Y, or both, are true.

Command-Line Programs
  • Use commandArgs(trailingOnly = TRUE) to obtain a vector of the command-line arguments that a program was run with.

  • Avoid silent failures.

  • Use file("stdin") to connect to a program’s standard input.

  • Use cat(vec, sep = " ") to write the elements of vec to standard output, one per line.

Best Practices for Writing R
  • Start each program with a description of what it does.

  • Then load all required packages.

  • Consider what working directory you are in when sourcing a script.

  • Use comments to mark off sections of code.

  • Put function definitions at the top of your file, or in a separate file if there are many.

  • Name and style code consistently.

  • Break code into small, discrete pieces.

  • Factor out common operations rather than repeating them.

  • Keep all of the source files for a project in one directory and use relative paths to access them.

  • Keep track of the memory used by your program.

  • Always start with a clean environment instead of saving the workspace.

  • Keep track of session information in your project folder.

  • Have someone else review your code.

  • Use version control.

Dynamic Reports with knitr
  • Use knitr to generate reports that combine text, code, and results.

  • Use Markdown to format text.

  • Put code in blocks delimited by triple back quotes followed by {r}.

Making Packages in R
  • A package is the basic unit of reusability in R.

  • Every package must have a DESCRIPTION file and an R directory containing code.

Introduction to RStudio
  • Using RStudio can make programming in R much more productive.

Addressing Data
  • Data in data frames can be addressed by index (slicing), by logical vector, or by name (columns only).

  • Use the $ operator to address a column by name.

Reading and Writing CSV Files
  • Import data from a .csv file using the read.csv(...) function.

  • Understand some of the key arguments available for importing the data properly, including header, stringsAsFactors, as.is, and strip.white.

  • Write data to a new .csv file using the write.csv(...) function

  • Understand some of the key arguments available for exporting the data properly, such as row.names, col.names, and na.

Understanding Factors
  • Factors are used to represent categorical data.

  • Factors can be ordered or unordered.

  • Some R functions have special methods for handling factors.

Data Types and Structures
  • R’s basic data types are character, numeric, integer, complex, and logical.

  • R’s basic data structures include the vector, list, matrix, data frame, and factors.

  • Objects may have attributes, such as name, dimension, and class.

The Call Stack
  • R keeps track of active function calls using a call stack comprised of stack frames.

  • Only global variables and variables in the current stack frame can be accessed directly.

Loops in R
  • Where possible, use vectorized operations instead of for loops to make code faster and more concise.

  • Use functions such as apply instead of for loops to operate on the values in a data structure.

Basic Operation

List objects in current environment ls()

Remove objects in current environment rm(x)

Remove all objects from current environment rm(list = ls())

Control Flow

if(x > 0){
	print("value is positive")
} else if (x < 0){
	print("value is negative")
} else{
	print("value is neither positive nor negative")
}
for (i in 1:5) {
	print(i)
}

This will print:

1
2
3
4
5

Functions

is_positive <- function(integer_value){
	if(integer_value > 0){
	   TRUE
	}
	else{
	   FALSE
	{
}

In R, the last executed line of a function is automatically returned

increment_me <- function(value_to_increment, value_to_increment_by = 1){
	value_to_increment + value_to_increment_by
}

increment_me(4), will return 5

increment_me(4, 6), will return 10

apply(dat, MARGIN = 2, mean) will return the average (mean) of each column in dat

Packages

Glossary

argument
A value given to a function or program when it runs. The term is often used interchangeably (and inconsistently) with parameter.
call stack
A data structure inside a running program that keeps track of active function calls. Each call’s variables are stored in a stack frame; a new stack frame is put on top of the stack for each call, and discarded when the call is finished.
comma-separated values (CSV)
A common textual representation for tables in which the values in each row are separated by commas.
comment
A remark in a program that is intended to help human readers understand what is going on, but is ignored by the computer. Comments in Python, R, and the Unix shell start with a # character and run to the end of the line; comments in SQL start with --, and other languages have other conventions.
conditional statement
A statement in a program that might or might not be executed depending on whether a test is true or false.
dimensions (of an array)
An array’s extent, represented as a vector. For example, an array with 5 rows and 3 columns has dimensions (5,3).
documentation
Human-language text written to explain what software does, how it works, or how to use it.
encapsulation
The practice of hiding something’s implementation details so that the rest of a program can worry about what it does rather than how it does it.
for loop
A loop that is executed once for each value in some kind of set, list, or range. See also: while loop.
function body
The statements that are executed inside a function.
function call
A use of a function in another piece of software.
function composition
The immediate application of one function to the result of another, such as f(g(x)).
index
A subscript that specifies the location of a single value in a collection, such as a single pixel in an image.
loop variable
The variable that keeps track of the progress of the loop.
notional machine
An abstraction of a computer used to think about what it can and will do.
parameter
A variable named in the function’s declaration that is used to hold a value passed into the call. The term is often used interchangeably (and inconsistently) with argument.
pipe
A connection from the output of one program to the input of another. When two or more programs are connected in this way, they are called a “pipeline”.
return statement
A statement that causes a function to stop executing and return a value to its caller immediately.
silent failure
Failing without producing any warning messages. Silent failures are hard to detect and debug.
slice
A regular subsequence of a larger sequence, such as the first five elements or every second element.
stack frame
A data structure that provides storage for a function’s local variables. Each time a function is called, a new stack frame is created and put on the top of the call stack. When the function returns, the stack frame is discarded.
standard input (stdin)
A process’s default input stream. In interactive command-line applications, it is typically connected to the keyboard; in a pipe, it receives data from the standard output of the preceding process.
standard output (stdout)
A process’s default output stream. In interactive command-line applications, data sent to standard output is displayed on the screen; in a pipe, it is passed to the standard input of the next process.
string
Short for “character string”, a sequence of zero or more characters.
while loop
A loop that keeps executing as long as some condition is true. See also: for loop.