Content from Analyzing Patient Data
Last updated on 2025-10-07 | Edit this page
Estimated time: 55 minutes
Overview
Questions
- How do I read data into R?
 - How do I assign variables?
 - What is a data frame?
 - How do I access subsets of a data frame?
 - How do I calculate simple statistics like mean and median?
 - Where can I get help?
 - How can I plot my data?
 
Objectives
- Read tabular data from a file into a program.
 - Assign values to variables.
 - Select individual values and subsections from data.
 - Perform operations on a data frame of data.
 - Display simple graphs.
 
We are studying inflammation in patients who have been given a new treatment for arthritis, and need to analyze the first dozen data sets (here is a link to download the data for this lesson). The data sets are stored in comma-separated values (CSV) format. Each row holds the observations for just one patient. Each column holds the inflammation measured in a day, so we have a set of values in successive days. The first few rows of our first file look like this:
OUTPUT
0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1
0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1
We want to:
- Load data into memory,
 - Calculate the average value of inflammation per day across all patients, and
 - Plot the results.
 
To do all that, we’ll have to learn a little bit about programming.
Getting started
Since we want to import the file called
inflammation-01.csv into our R environment, we need to be
able to tell our computer where the file is. To do this, we will create
a “Project” with RStudio that contains the data we want to work with.
The “Projects” interface in RStudio not only creates a working directory
for you, but also remembers its location (allowing you to quickly
navigate to it). The interface also (optionally) preserves custom
settings and open files to make it easier to resume work after a
break.
Create a new project
- Under the 
Filemenu in RStudio, click onNew project, chooseNew directory, thenNew project - Enter a name for this new folder (or “directory”) and choose a
convenient location for it. This will be your working
directory for the rest of the day (e.g.,
~/Desktop/r-novice-inflammation). - Click on 
Create project - Create a new file where we will type our scripts. Go to File >
New File > R script. Click the save icon on your toolbar and save
your script as “
script.R”. - Make sure you copy the data for the lesson into this folder, if they’re not there already.
 
Loading Data
Now that we are set up with an Rstudio project, we are sure that the
data and scripts we are using are all in our working directory. The data
files should be located in the directory data, inside the
working directory. Now we can load the data into R using
read.csv:
R
read.csv(file = "data/inflammation-01.csv", header = FALSE)
The expression read.csv(...) is a function call that asks R to run
the function read.csv.
read.csv has two arguments: the name of the file we
want to read, and whether the first line of the file contains names for
the columns of data. The filename needs to be a character string (or stringfor short), so we put it in
quotes. Assigning the second argument, header, to be
FALSE indicates that the data file does not have column
headers. We’ll talk more about the value FALSE, and its
converse TRUE, in lesson 04. In case of our
inflammation-01.csv example, R auto-generates column names
in the sequence V1 (for “variable 1”), V2, and
so on, until V40.
Other Options for Reading CSV Files
read.csv actually has many more arguments that you may
find useful when importing your own data in the future. You can learn
more about these options in this supplementary lesson.
Loading Data with Headers
What happens if you forget to put header = FALSE? The
default value is header = TRUE, which you can check with
?read.csv or help(read.csv). What do you
expect will happen if you leave the default value of
header? Before you run any code, think about what will
happen to the first few rows of your data frame, and its overall size.
Then run the following code and see if your expectations agree:
R
read.csv(file = "data/inflammation-01.csv")
R will construct column headers from values in your first row of
data, resulting in X0 X0.1 X1 X3 X1.1 X2 ....
Note that the character X is prepended: a standalone
number would not be a valid variable name. Because column headers are
variables, the same naming rules apply. Appending .1,
.2 etc. is necessary to avoid duplicate column headers.
Reading Different Decimal Point Formats
Depending on the country you live in, your standard can use the “dot”
or the “comma” as decimal mark. Also, different devices or software can
generate data with different kinds of decimal marks. Take a look at
?read.csv and write the code to load a file called
commadec.txt that has numeric values with commas as decimal
mark, separated by semicolons.
R
read.csv(file = "data/commadec.txt", sep = ";", dec = ",")
or the built-in shortcut:
R
read.csv2(file = "data/commadec.txt")
A function will perform its given action on whatever value is passed
to the argument(s). For example, in this case if we provided the name of
a different file to the argument file,
read.csv would read that instead. We’ll learn more about
the details of functions and their arguments in the next lesson.
Since we didn’t tell it to do anything else with the function’s
output, the console will display the full contents of the file
inflammation-01.csv. Try it out.
read.csv reads the file, but we can’t use the data
unless we assign it to a variable. We can think of a variable as a
container with a name, such as x,
current_temperature, or subject_id that
contains one or more values. We can create a new variable and assign a
value to it using <-.
R
weight_kg <- 55
Once a variable is created, we can use the variable name to refer to
the value it was assigned. The variable name now acts as a tag. Whenever
R reads that tag (weight_kg), it substitutes the value
(55).
To see the value of a variable, we can print it by typing the name of
the variable and hitting CTRL + Return (or CTRL +
Enter) while working in the “script.R” file in the
editor which is recommended. If we are working in the console directly,
we need to hitReturn (or Enter). In general, R
will print to the console any object returned by a function or operation
unless we assign it to a variable.
R
weight_kg
OUTPUT
[1] 55
We can treat our variable like a regular number, and do arithmetic with it:
R
# weight in pounds:
2.2 * weight_kg
OUTPUT
[1] 121
Commenting
We can add comments to our code using the # character.
It is useful to document our code in this way so that others (and us the
next time we read it) have an easier time following what the code is
doing.
We can also change a variable’s value by assigning it a new value:
R
weight_kg <- 57.5
# weight in kilograms is now
weight_kg
OUTPUT
[1] 57.5
Variable Naming Conventions
Historically, R programmers have used a variety of conventions for
naming variables. The . character in R can be a valid part
of a variable name; thus the above assignment could have easily been
weight.kg <- 57.5. This is often confusing to R
newcomers who have programmed in languages where . has a
more significant meaning. Today, most R programmers 1) start variable
names with lower case letters, 2) separate words in variable names with
underscores, and 3) use only lowercase letters, underscores, and numbers
in variable names. TheTidyverse Style Guide includes a section on this and
other style considerations.
Assigning a new value to a variable breaks the connection with the old value; R forgets that number and applies the variable name to the new value.
When you assign a value to a variable, R only stores the value, not the calculation you used to create it. This is an important point if you’re used to the way a spreadsheet program automatically updates linked cells. Let’s look at an example.
First, we’ll convert weight_kg into pounds, and store
the new value in the variable weight_lb:
R
weight_lb <- 2.2 * weight_kg
# weight in kg...
weight_kg
OUTPUT
[1] 57.5
R
# ...and in pounds
weight_lb
OUTPUT
[1] 126.5
In words, we’re asking R to look up the value we tagged
weight_kg, multiply it by 2.2, and tag the result with the
name weight_lb:
If we now change the value of weight_kg:
R
weight_kg <- 100.0
# weight in kg now...
weight_kg
OUTPUT
[1] 100
R
# ...and weight in pounds still
weight_lb
OUTPUT
[1] 126.5
Since weight_lb doesn’t “remember” where its value came
from, it isn’t automatically updated when weight_kg
changes. This is different from the way spreadsheets work.
Printing with Parentheses
An alternative way to print the value of a variable is to use
( parentheses ) around the assignment
statement. As an example:
(total_weight <- weight_kg*2.2 + weight_lb) adds the
values of weight_kg*2.2 and weight_lb, assigns
the result to the total_weight, and finally prints the
assigned value of the variable total_weight.
Now that we know how to assign things to variables, let’s re-run
read.csv and save its result into a variable called
dat:
R
dat <- read.csv(file = "data/inflammation-01.csv", header = FALSE)
This statement doesn’t produce any output because the assignment
doesn’t display anything. If we want to check if our data has been
loaded, we can print the variable’s value by typing the name of the
variable dat. However, for large data sets it is convenient
to use the functionhead to display only the first few rows
of data.
R
head(dat)
OUTPUT
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
1  0  0  1  3  1  2  4  7  8   3   3   3  10   5   7   4   7   7  12  18   6
2  0  1  2  1  2  1  3  2  2   6  10  11   5   9   4   4   7  16   8   6  18
3  0  1  1  3  3  2  6  2  5   9   5   7   4   5   4  15   5  11   9  10  19
4  0  0  2  0  4  2  2  1  6   7  10   7   9  13   8   8  15  10  10   7  17
5  0  1  1  3  3  1  3  5  2   4   4   7   6   5   3  10   8  10   6  17   9
6  0  0  1  2  2  4  2  1  6   4   7   6   6   9   9  15   4  16  18  12  12
  V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40
1  13  11  11   7   7   4   6   8   8   4   4   5   7   3   4   2   3   0   0
2   4  12   5  12   7  11   5  11   3   3   5   4   4   5   5   1   1   0   1
3  14  12  17   7  12  11   7   4   2  10   5   4   2   2   3   2   2   1   1
4   4   4   7   6  15   6   4   9  11   3   5   6   3   3   4   2   3   2   1
5  14   9   7  13   9  12   6   7   7   9   6   3   2   2   4   2   0   1   1
6   5  18   9   5   3  10   3  12   7   8   4   7   3   5   4   4   3   2   1
Assigning Values to Variables
Draw diagrams showing what variables refer to what values after each statement in the following program:
R
mass <- 47.5
age <- 122
mass <- mass * 2.0
age <- age - 20
R
mass <- 47.5
age <- 122
R
mass <- mass * 2.0
age <- age - 20
Manipulating Data
Now that our data are loaded into R, we can start doing things with
them. First, let’s ask what type of thing dat is:
R
class(dat)
OUTPUT
[1] "data.frame"
The output tells us that it’s a data frame. Think of this structure as a spreadsheet in MS Excel that many of us are familiar with. Data frames are very useful for storing data and you will use them frequently when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns.
We can see the shape, or dimensions, of the data
frame with the function dim:
R
dim(dat)
OUTPUT
[1] 60 40
This tells us that our data frame, dat, has 60 rows and
40 columns.
If we want to get a single value from the data frame, we can provide an indexin square brackets. The first number specifies the row and the second the column:
R
# first value in dat, row 1, column 1
dat[1, 1]
OUTPUT
[1] 0
R
# middle value in dat, row 30, column 20
dat[30, 20]
OUTPUT
[1] 16
The first value in a data frame index is the row, the second value is
the column. If we want to select more than one row or column, we can use
the function c, which combines the values
you give it into one vector or list. For example, to pick columns 10 and
20 from rows 1, 3, and 5, we can do this:
R
dat[c(1, 3, 5), c(10, 20)]
OUTPUT
  V10 V20
1   3  18
3   9  10
5   4  17
We frequently want to select contiguous rows or columns, such as the
first ten rows, or columns 3 through 7. You can use c for
this, but it’s more convenient to use the : operator. This
special function generates sequences of numbers:
R
1:5
OUTPUT
[1] 1 2 3 4 5
R
3:12
OUTPUT
 [1]  3  4  5  6  7  8  9 10 11 12
For example, we can select the first ten columns of values for the first four rows like this:
R
dat[1:4, 1:10]
OUTPUT
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1  0  0  1  3  1  2  4  7  8   3
2  0  1  2  1  2  1  3  2  2   6
3  0  1  1  3  3  2  6  2  5   9
4  0  0  2  0  4  2  2  1  6   7
or the first ten columns of rows 5 to 10 like this:
R
dat[5:10, 1:10]
OUTPUT
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
5   0  1  1  3  3  1  3  5  2   4
6   0  0  1  2  2  4  2  1  6   4
7   0  0  2  2  4  2  2  5  5   8
8   0  0  1  2  3  1  2  3  5   3
9   0  0  0  3  1  5  6  5  5   8
10  0  1  1  2  1  3  5  3  5   8
If you want to select all rows or all columns, leave that index value empty.
R
# All columns from row 5
dat[5, ]
OUTPUT
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
5  0  1  1  3  3  1  3  5  2   4   4   7   6   5   3  10   8  10   6  17   9
  V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40
5  14   9   7  13   9  12   6   7   7   9   6   3   2   2   4   2   0   1   1
R
# All rows from column 16-18
dat[, 16:18]
OUTPUT
   V16 V17 V18
1    4   7   7
2    4   7  16
3   15   5  11
4    8  15  10
5   10   8  10
6   15   4  16
7   13   5  12
8    9  15  11
9   11   9  10
10   6  13   8
11   3   7  13
12   8  14  11
13  12   4  17
14   3  10  13
15   5   7  17
16  10   7   8
17  11  12   5
18   4  14   7
19  11  15  17
20  13   6   5
21  15  13   6
22   5  12  12
23  14   5   5
24  13   7  14
25   4  12   9
26   9   5  16
27  13   4  13
28   6  15   6
29   7   6  11
30   6   8   7
31  14  12   8
32   3   8  10
33  15  15  10
34   4  12   9
35  15   9  17
36  11   5   7
37   7   4   7
38  10   6   7
39  15  12  13
40   6   8  15
41   5   7   5
42   6  10  13
43  15  11  12
44  11   6  10
45  15  12  15
46   6   7  11
47  11  16  12
48  15   5  15
49  14   4   6
50   4   7   9
51  10  13   6
52  15  15  12
53  11  15  13
54   6  11  12
55  13   8   9
56   8   8  16
57   4  16  11
58  13  13   9
59  12  15   5
60   9  14  11
If you leave both index values empty (i.e., dat[,]), you
get the entire data frame.
Addressing Columns by Name
Columns can also be addressed by name, with either the $
operator (ie. dat$V16) or square brackets (ie.
dat[, 'V16']). You can learn more about subsetting by
column name in this supplementary lesson.
Now let’s perform some common mathematical operations to learn more about our inflammation data. When analyzing data we often want to calculate statistics for specific subsets or groups of the data frame, such as the maximum value per patient or the average value per day. One way to do this is to select the data we want to create a new temporary data frame, and then perform the calculation on this subset:
R
# first row, all of the columns
patient_1 <- dat[1, ]
# max inflammation for patient 1
max(patient_1)
OUTPUT
[1] 18
We don’t actually need to store the row in a variable of its own. Instead, we can combine the selection and the function call:
R
# max inflammation for patient 2
max(dat[2, ])
OUTPUT
[1] 18
R also has functions for other common calculations, e.g. finding the minimum, mean, median, and standard deviation of the data:
R
# minimum inflammation on day 7
min(dat[, 7])
OUTPUT
[1] 1
R
# mean inflammation on day 7
mean(dat[, 7])
OUTPUT
[1] 3.8
R
# median inflammation on day 7
median(dat[, 7])
OUTPUT
[1] 4
R
# standard deviation of inflammation on day 7
sd(dat[, 7])
OUTPUT
[1] 1.725187
Forcing Conversion
Note that R may return an error when you attempt to perform similar
calculations on subsetted rowsof data frames. This is because
some functions in R automatically convert the object type to a numeric
vector, while others do not (e.g. max(dat[1, ]) works as
expected, while mean(dat[1, ]) returns NA and
a warning). You get the expected output by including an explicit call to
as.numeric(), e.g. mean(as.numeric(dat[1, ])).
By contrast, calculations on subsetted columns always work as
expected, since columns of data frames are already defined as
vectors.
R also has a function that summaries the previous common calculations:
R
# Summarize function
summary(dat[, 1:4])
OUTPUT
       V1          V2             V3              V4
 Min.   :0   Min.   :0.00   Min.   :0.000   Min.   :0.00
 1st Qu.:0   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:1.00
 Median :0   Median :0.00   Median :1.000   Median :2.00
 Mean   :0   Mean   :0.45   Mean   :1.117   Mean   :1.75
 3rd Qu.:0   3rd Qu.:1.00   3rd Qu.:2.000   3rd Qu.:3.00
 Max.   :0   Max.   :1.00   Max.   :2.000   Max.   :3.00  
For every column in the data frame, the function “summary” calculates: the minimum value, the first quartile, the median, the mean, the third quartile and the max value, giving helpful details about the sample distribution.
What if we need the maximum inflammation for all patients, or the average for each day? As the diagram below shows, we want to perform the operation across a margin of the data frame:
To support this, we can use the apply function.
Getting Help
To learn about a function in R, e.g. apply, we can read
its help documentation by running help(apply) or
?apply.
apply allows us to repeat a function on all of the rows
(MARGIN = 1) or columns (MARGIN = 2) of a data
frame.
Thus, to obtain the average inflammation of each patient we will need
to calculate the mean of all of the rows (MARGIN = 1) of
the data frame.
R
avg_patient_inflammation <- apply(dat, 1, mean)
And to obtain the average inflammation of each day we will need to
calculate the mean of all of the columns (MARGIN = 2) of
the data frame.
R
avg_day_inflammation <- apply(dat, 2, mean)
Since the second argument to apply is
MARGIN, the above command is equivalent to
apply(dat, MARGIN = 2, mean). We’ll learn why this is so in
the next lesson.
Efficient Alternatives
Some common operations have more efficient alternatives. For example,
you can calculate the row-wise or column-wise means with
rowMeans and colMeans, respectively.
Subsetting Data
We can take subsets of character vectors as well:
R
animal <- c("m", "o", "n", "k", "e", "y")
# first three characters
animal[1:3]
OUTPUT
[1] "m" "o" "n"
R
# last three characters
animal[4:6]
OUTPUT
[1] "k" "e" "y"
If the first four characters are selected using the subset
animal[1:4], how can we obtain the first four characters in reverse order?What is
animal[-1]? What isanimal[-4]? Given those answers, explain whatanimal[-1:-4]does.Use a subset of
animalto create a new character vector that spells the word “eon”, i.e.c("e", "o", "n").
animal[4:1]"o" "n" "k" "e" "y"and"m" "o" "n" "e" "y", which means that a single-removes the element at the given index position.animal[-1:-4]remove the subset, returning"e" "y", which is equivalent toanimal[5:6].animal[c(5,2,3)]combines indexing with thecfunction.
Subsetting More Data
Suppose you want to determine the maximum inflammation for patient 5 across days three to seven. To do this you would extract the relevant subset from the data frame and calculate the maximum value. Which of the following lines of R code gives the correct answer?
max(dat[5, ])max(dat[3:7, 5])max(dat[5, 3:7])max(dat[5, 3, 7])
Answer: 3
Explanation: You want to extract the part of the dataframe
representing data for patient 5 from days three to seven. In this
dataframe, patient data is organised in rows and the days are
represented by the columns. Subscripting in R follows the
[i, j] principle, where i = rows and
j = columns. Thus, answer 3 is correct since the patient is
represented by the value for i (5) and the days are represented by the
values in j, which is a subset spanning day 3 to 7.
Subsetting and Re-Assignment
Using the inflammation data frame dat from above: Let’s
pretend there was something wrong with the instrument on the first five
days for every second patient (#2, 4, 6, etc.), which resulted in the
measurements being twice as large as they should be.
- Write a vector containing each affected patient (hint:
?seq) - Create a new data frame in which you halve the first five days’ values in only those patients
 - Print out the corrected data frame to check that your code has fixed the problem
 
R
whichPatients <- seq(2, 60, 2) # i.e., which rows
whichDays <- seq(1, 5)         # i.e., which columns
dat2 <- dat
# check the size of your subset: returns `30 5`, that is 30 [rows=patients] by 5 [columns=days]
dim(dat2[whichPatients, whichDays])
dat2[whichPatients, whichDays] <- dat2[whichPatients, whichDays] / 2
dat2
Using the Apply Function on Patient Data
Challenge: the apply function can be used to summarize datasets and subsets of data across rows and columns using the MARGIN argument. Suppose you want to calculate the mean inflammation for specific days and patients in the patient dataset (i.e. 60 patients across 40 days).
Please use a combination of the apply function and indexing to:
- calculate the mean inflammation for patients 1 to 5 over the whole 40 days
 - calculate the mean inflammation for days 1 to 10 (across all patients).
 - calculate the mean inflammation for every second day (across all patients).
 
Think about the number of rows and columns you would expect as the result before each apply call and check your intuition by applying the mean function.
R
# 1.
apply(dat[1:5, ], 1, mean)
# 2.
apply(dat[, 1:10], 2, mean)
# 3.
apply(dat[, seq(2, 40, by = 2)], 2, mean)
Plotting
The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers,” and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few of R’s plotting features.
Let’s take a look at the average inflammation over time. Recall that
we already calculated these values above using
apply(dat, 2, mean) and saved them in the variable
avg_day_inflammation. Plotting the values is done with the
function plot.
R
plot(avg_day_inflammation)
Above, we gave the function plot a vector of numbers
corresponding to the average inflammation per day across all patients.
plot created a scatter plot where the y-axis is the average
inflammation level and the x-axis is the order, or index, of the values
in the vector, which in this case correspond to the 40 days of
treatment. The result is roughly a linear rise and fall, which is
suspicious: based on other studies, we expect a sharper rise and slower
fall. Let’s have a look at two other statistics: the maximum and minimum
inflammation per day.
R
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
R
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
The maximum value rises and falls perfectly smoothly, while the minimum seems to be a step function. Neither result seems particularly likely, so either there’s a mistake in our calculations or something is wrong with our data.
Plotting Data
Create a plot showing the standard deviation of the inflammation data for each day across all patients.
R
sd_day_inflammation <- apply(dat, 2, sd)
plot(sd_day_inflammation)
Generative AI Help
The section on generative AI is intended to be concise but Instructors may choose to devote more time to the topic in a workshop. Depending on your own level of experience and comfort with talking about and using these tools, you could choose to do any of the following:
- Explain how large language models work and are trained, and/or the difference between generative AI, other forms of AI that currently exist, and the limits of what LLMs can do (e.g., they can’t “reason”).
 - Demonstrate how you recommend that learners use generative AI.
 - Discuss the ethical concerns listed below, as well as others that you are aware of, to help learners make an informed choice about whether or not to use generative AI tools.
 
This is a fast-moving technology. If you are preparing to teach this section and you feel it has become outdated, please open an issue on the lesson repository to let the Maintainers know and/or a pull request to suggest updates and improvements.
It is increasingly common for people to use generative AI chatbots such as ChatGPT to get help while coding. You will probably receive some useful guidance by presenting your error message to the chatbot and asking it what went wrong. However, the way this help is provided by the chatbot is different. Answers on StackOverflow have (probably) been given by a human as a direct response to the question asked. But generative AI chatbots, which are based on an advanced statistical model, respond by generating the most likely sequence of text that would follow the prompt they are given.
While responses from generative AI tools can often be helpful, they are not always reliable. These tools sometimes generate plausible but incorrect or misleading information, so (just as with an answer found on the internet) it is essential to verify their accuracy. You need the knowledge and skills to be able to understand these responses, to judge whether or not they are accurate, and to fix any errors in the code it offers you.
In addition to asking for help, programmers can use generative AI tools to generate code from scratch; extend, improve and reorganise existing code; translate code between programming languages; figure out what terms to use in a search of the internet; and more. However, there are drawbacks that you should be aware of.
The models used by these tools have been “trained” on very large volumes of data, much of it taken from the internet, and the responses they produce reflect that training data, and may recapitulate its inaccuracies or biases. The environmental costs (energy and water use) of LLMs are a lot higher than other technologies, both during development (known as training) and when an individual user uses one (also called inference). For more information see the AI Environmental Impact Primer developed by researchers at HuggingFace, an AI hosting platform. Concerns also exist about the way the data for this training was obtained, with questions raised about whether the people developing the LLMs had permission to use it. Other ethical concerns have also been raised, such as reports that workers were exploited during the training process.
We recommend that you avoid getting help from generative AI during the workshop for several reasons:
- For most problems you will encounter at this stage, help and answers can be found among the first results returned by searching the internet.
 - The foundational knowledge and skills you will learn in this lesson by writing and fixing your own programs are essential to be able to evaluate the correctness and safety of any code you receive from online help or a generative AI chatbot. If you choose to use these tools in the future, the expertise you gain from learning and practising these fundamentals on your own will help you use them more effectively.
 - As you start out with programming, the mistakes you make will be the kinds that have also been made – and overcome! – by everybody else who learned to program before you. Since these mistakes and the questions you are likely to have at this stage are common, they are also better represented than other, more specialised problems and tasks in the data that was used to train generative AI tools. This means that a generative AI chatbot is more likely to produce accurate responses to questions that novices ask, which could give you a false impression of how reliable they will be when you are ready to do things that are more advanced.
 
- Use 
variable <- valueto assign a value to a variable in order to record it in memory. - Objects are created on demand whenever a value is assigned to them.
 - The function 
dimgives the dimensions of a data frame. - Use 
object[x, y]to select a single element from a data frame. - Use 
from:toto specify a sequence that includes the indices fromfromtoto. - All the indexing and subsetting that works on data frames also works on vectors.
 - Use 
#to add comments to programs. - Use 
mean,max,minandsdto calculate simple statistics. - Use 
applyto calculate statistics across the rows or columns of a data frame. - Use 
plotto create simple visualizations. 
Content from Creating Functions
Last updated on 2025-10-07 | Edit this page
Estimated time: 40 minutes
Overview
Questions
- How do I make a function?
 - How can I test my functions?
 - How should I document my code?
 
Objectives
- Define a function that takes arguments.
 - Return a value from a function.
 - Test a function.
 - Set default values for function arguments.
 - Explain why we should divide programs into small, single-purpose functions.
 
If we only had one data set to analyze, it would probably be faster to load the file into a spreadsheet and use that to plot some simple statistics. But we have twelve files to check, and may have more in the future. In this lesson, we’ll learn how to write a function so that we can repeat several operations with a single command.
Defining a Function
You can write your own functions in order to make repetitive operations using a single command. Let’s start by defining your function “my_function” and the input parameter(s) that the user will feed to the function. Afterwards you will define the operation that you desire to program in the body of the function within curly braces ({}). Finally, you need to assign the result (or output) of your function in the return statement.
Now let’s see this process with an example. We are going to define a
function fahrenheit_to_celsius that converts temperatures
from Fahrenheit
to Celsius:
R
fahrenheit_to_celsius <- function(temp_F) {
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}
We define fahrenheit_to_celsius by assigning it to the
output of function. The list of argument names are
contained within parentheses. Next, the body of the function–the
statements that are executed when it runs–is contained within curly
braces ({}). The statements in the body are indented by two
spaces, which makes the code easier to read but does not affect how the
code operates.
When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.
Automatic Returns
In R, it is not necessary to include the return statement. R automatically returns whichever variable is on the last line of the body of the function. While in the learning phase, we will explicitly define the return statement.
Let’s try running our function. Calling our own function is no different from calling any other function:
R
# freezing point of water
fahrenheit_to_celsius(32)
OUTPUT
[1] 0
R
# boiling point of water
fahrenheit_to_celsius(212)
OUTPUT
[1] 100
We’ve successfully called the function that we defined, and we have access to the value that we returned.
Composing Functions
Now that we’ve seen how to turn Fahrenheit into Celsius, we can also turn Celsius into Kelvin:
R
celsius_to_kelvin <- function(temp_C) {
  temp_K <- temp_C + 273.15
  return(temp_K)
}
# freezing point of water in Kelvin
celsius_to_kelvin(0)
OUTPUT
[1] 273.15
What about converting Fahrenheit to Kelvin? We could write out the formula, but we don’t need to. Instead, we can compose the two functions we have already created:
R
fahrenheit_to_kelvin <- function(temp_F) {
  temp_C <- fahrenheit_to_celsius(temp_F)
  temp_K <- celsius_to_kelvin(temp_C)
  return(temp_K)
}
# freezing point of water in Kelvin
fahrenheit_to_kelvin(32.0)
OUTPUT
[1] 273.15
This is our first taste of how larger programs are built: we define basic operations, then combine them in ever-larger chunks to get the effect we want. Real-life functions will usually be larger than the ones shown here–typically half a dozen to a few dozen lines–but they shouldn’t ever be much longer than that, or the next person who reads it won’t be able to understand what’s going on.
Nesting Function Calls
This example showed the output of fahrenheit_to_celsius
assigned to temp_C, which is then passed to
celsius_to_kelvinto get the final result. It is also
possible to perform this calculation in one line of code, by “nesting”
one function call inside another, like so:
R
# freezing point of water in Fahrenheit
celsius_to_kelvin(fahrenheit_to_celsius(32.0))
OUTPUT
[1] 273.15
Here, we call fahrenheit_to_celsius to convert
32.0 from Fahrenheit to Celsius, and immediately pass the
value returned from fahrenheit_to_celsius to
celsius_to_kelvin to convert from Celsius to Kelvin. Our
conversion from Fahrenheit to Kelvin is done, all in one go!
This is convenient, but you should be careful not to nest too many function calls at once - it can become confusing and difficult to read!
Create a Function
In the last lesson, we learned to combine elements
into a vector using the c function,
e.g. x <- c("A", "B", "C") creates a vector
x with three elements. Furthermore, we can extend that
vector again using c, e.g. y <- c(x, "D")
creates a vector y with four elements. Write a function
called highlight that takes two vectors as arguments,
called content and wrapper, and returns a new
vector that has the wrapper vector at the beginning and end of the
content:
R
best_practice <- c("Write", "programs", "for", "people", "not", "computers")
asterisk <- "***"  # R interprets a variable with a single value as a vector
                   # with one element.
highlight(best_practice, asterisk)
OUTPUT
[1] "***"       "Write"     "programs"  "for"       "people"    "not"
[7] "computers" "***"      
R
highlight <- function(content, wrapper) {
  answer <- c(wrapper, content, wrapper)
  return(answer)
}
Create a Function (continued)
If the variable v refers to a vector, then
v[1] is the vector’s first element and
v[length(v)] is its last (the function length
returns the number of elements in a vector). Write a function called
edges that returns a vector made up of just the first and
last elements of its input:
R
dry_principle <- c("Don't", "repeat", "yourself", "or", "others")
edges(dry_principle)
OUTPUT
[1] "Don't"  "others"
R
edges <- function(v) {
   first <- v[1]
   last <- v[length(v)]
   answer <- c(first, last)
   return(answer)
}
The Call Stack
For a deeper understanding of how functions work, you’ll need to learn how they create their own environments and call other functions. Function calls are managed via the call stack. For more details on the call stack, have a look at the supplementary material.
Named Variables and the Scope of Variables
Functions can accept arguments explicitly assigned to a variable name
in the function call functionName(variable = value), as
well as arguments by order:
R
input_1 <- 20
mySum <- function(input_1, input_2 = 10) {
  output <- input_1 + input_2
  return(output)
}
- Given the above code was run, which value does
mySum(input_1 = 1, 3)produce? 
- 4
 - 11
 - 23
 - 30
 
- If 
mySum(3)returns 13, why doesmySum(input_2 = 3)return an error? 
The solution is
a..Read the error message:
argument "input_1" is missing, with no defaultmeans that no value forinput_1is provided in the function call, and neither in the function’s definition. Thus, the addition in the function body can not be completed.
Testing, Error Handling, and Documenting
Once we start putting things in functions so that we can re-use them, we need to start testing that those functions are working correctly. To see how to do this, let’s write a function to center a dataset around a particular midpoint:
R
center <- function(data, midpoint) {
  new_data <- (data - mean(data)) + midpoint
  return(new_data)
}
We could test this on our actual data, but since we don’t know what the values ought to be, it will be hard to tell if the result was correct. Instead, let’s create a vector of 0s and then center that around 3. This will make it simple to see if our function is working as expected:
R
z <- c(0, 0, 0, 0)
z
OUTPUT
[1] 0 0 0 0
R
center(z, 3)
OUTPUT
[1] 3 3 3 3
That looks right, so let’s try center on our real data. We’ll center the inflammation data from day 4 around 0:
R
dat <- read.csv(file = "data/inflammation-01.csv", header = FALSE)
centered <- center(dat[, 4], 0)
head(centered)
OUTPUT
[1]  1.25 -0.75  1.25 -1.75  1.25  0.25
It’s hard to tell from the default output whether the result is correct, but there are a few simple tests that will reassure us:
R
# original mean
mean(dat[, 4])
OUTPUT
[1] 1.75
R
# centered mean
mean(centered)
OUTPUT
[1] 0
That seems right: the original mean was about 1.75 and the mean of the centered data is 0. We can even go further and check that the standard deviation hasn’t changed:
R
# original standard deviation
sd(dat[, 4])
OUTPUT
[1] 1.067628
R
# centered standard deviation
sd(centered)
OUTPUT
[1] 1.067628
Those values look the same, but we probably wouldn’t notice if they were different in the sixth decimal place. Let’s do this instead:
R
# difference in standard deviations before and after
sd(dat[, 4]) - sd(centered)
OUTPUT
[1] 0
Sometimes, a very small difference can be detected due to rounding at
very low decimal places. R has a useful function for comparing two
objects allowing for rounding errors, all.equal:
R
all.equal(sd(dat[, 4]), sd(centered))
OUTPUT
[1] TRUE
It’s still possible that our function is wrong, but it seems unlikely enough that we should probably get back to doing our analysis. However, there are two other important tasks to consider: 1) we should ensure our function can provide informative errors when needed, and 2) we should write some documentation for our function to remind ourselves later what it’s for and how to use it.
Error Handling
What happens if we have missing data (NA values) in the
data argument we provide to center?
R
# new data object and set one value in column 4 to NA
datNA <- dat
datNA[10,4] <- NA
# returns all NA values
center(datNA[,4], 0)
OUTPUT
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[51] NA NA NA NA NA NA NA NA NA NA
This is likely not the behavior we want, and is caused by the
mean function returning NA when the na.rm=TRUE
is not provided. We may wish to not consider NA values in our
centerfunction. We can provide the na.rm=TRUE
argument and solve this issue.
R
center <- function(data, midpoint) {
  new_data <- (data - mean(data, na.rm=TRUE)) + midpoint
  return(new_data)
}
center(datNA[,4], 0)
OUTPUT
 [1]  1.2542373 -0.7457627  1.2542373 -1.7457627  1.2542373  0.2542373
 [7]  0.2542373  0.2542373  1.2542373         NA -1.7457627 -1.7457627
[13] -0.7457627 -1.7457627 -0.7457627 -1.7457627 -1.7457627 -0.7457627
[19] -0.7457627 -1.7457627  1.2542373  1.2542373  1.2542373 -0.7457627
[25] -0.7457627 -0.7457627  0.2542373 -0.7457627  0.2542373 -0.7457627
[31] -1.7457627  1.2542373  0.2542373 -0.7457627  0.2542373  1.2542373
[37]  0.2542373  0.2542373  1.2542373  1.2542373  0.2542373  1.2542373
[43]  1.2542373  1.2542373  1.2542373  0.2542373  1.2542373  1.2542373
[49]  1.2542373  0.2542373 -0.7457627  0.2542373  0.2542373 -0.7457627
[55] -0.7457627  1.2542373  0.2542373 -0.7457627 -0.7457627 -1.7457627
However, what happens if the user were to accidentally hand this
function a factor or character vector?
R
datNA[,1] <- as.factor(datNA[,1])
datNA[,2] <- as.character(datNA[,2])
center(datNA[,1], 0)
WARNING
Warning in mean.default(data, na.rm = TRUE): argument is not numeric or
logical: returning NA
WARNING
Warning in Ops.factor(data, mean(data, na.rm = TRUE)): '-' not meaningful for
factors
OUTPUT
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[51] NA NA NA NA NA NA NA NA NA NA
R
center(datNA[,2], 0)
WARNING
Warning in mean.default(data, na.rm = TRUE): argument is not numeric or
logical: returning NA
ERROR
Error in data - mean(data, na.rm = TRUE): non-numeric argument to binary operator
Both of these attempts result in errors. Luckily, the errors are
quite informative. In other cases, we may need to add in error handling
using the warning and stop functions.
For instance, the center function only works on numeric
vectors. Recognizing this and adding warnings and errors provides
feedback to the user and makes sure the output of the function is what
the user wanted.
Documentation
A common way to put documentation in software is to add comments like this:
R
center <- function(data, midpoint) {
  # return a new vector containing the original data centered around the
  # midpoint.
  # Example: center(c(1, 2, 3), 0) => c(-1, 0, 1)
  new_data <- (data - mean(data)) + midpoint
  return(new_data)
}
Writing Documentation
Formal documentation for R functions is written in separate
.Rd using a markup language similar to LaTeX. You see the result of
this documentation when you look at the help file for a given function,
e.g. ?read.csv. The roxygen2
package allows R coders to write documentation alongside the function
code and then process it into the appropriate .Rd files.
You will want to switch to this more formal method of writing
documentation when you start writing more complicated R projects.
Functions to Create Graphs
Write a function called analyze that takes a filename as
an argument and displays the three graphs produced in the previous lesson (average, min and
max inflammation over time).
analyze("data/inflammation-01.csv") should produce the
graphs already shown, while
analyze("data/inflammation-02.csv") should produce
corresponding graphs for the second data set. Be sure to document your
function with comments.
R
analyze <- function(filename) {
  # Plots the average, min, and max inflammation over time.
  # Input is character string of a csv file.
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
}
Rescaling
Write a function rescale that takes a vector as input
and returns a corresponding vector of values scaled to lie in the range
0 to 1. (If L and H are the lowest and highest
values in the original vector, then the replacement for a value
v should be (v-L) / (H-L).) Be sure to
document your function with comments.
Test that your rescale function is working properly
using min, max, and plot.
R
rescale <- function(v) {
  # Rescales a vector, v, to lie in the range 0 to 1.
  L <- min(v)
  H <- max(v)
  result <- (v - L) / (H - L)
  return(result)
}
Defining Defaults
We have passed arguments to functions in two ways: directly, as in
dim(dat), and by name, as in
read.csv(file = "data/inflammation-01.csv", header = FALSE).
In fact, we can pass the arguments to read.csv without
naming them:
R
dat <- read.csv("data/inflammation-01.csv", FALSE)
However, the position of the arguments matters if they are not named.
R
dat <- read.csv(header = FALSE, file = "data/inflammation-01.csv")
dat <- read.csv(FALSE, "data/inflammation-01.csv")
ERROR
Error in read.table(file = file, header = header, sep = sep, quote = quote, : 'file' must be a character string or connection
To understand what’s going on, and make our own functions easier to
use, let’s re-define our center function like this:
R
center <- function(data, midpoint = 0) {
  # return a new vector containing the original data centered around the
  # midpoint (0 by default).
  # Example: center(c(1, 2, 3), 0) => c(-1, 0, 1)
  new_data <- (data - mean(data)) + midpoint
  return(new_data)
}
The key change is that the second argument is now written
midpoint = 0 instead of just midpoint. If we
call the function with two arguments, it works as it did before:
R
test_data <- c(0, 0, 0, 0)
center(test_data, 3)
OUTPUT
[1] 3 3 3 3
But we can also now call center() with just one
argument, in which case midpoint is automatically assigned
the default value of 0:
R
more_data <- 5 + test_data
more_data
OUTPUT
[1] 5 5 5 5
R
center(more_data)
OUTPUT
[1] 0 0 0 0
This is handy: if we usually want a function to work one way, but occasionally need it to do something else, we can allow people to pass an argument when they need to but provide a default to make the normal case easier.
The example below shows how R matches values to arguments
R
display <- function(a = 1, b = 2, c = 3) {
  result <- c(a, b, c)
  names(result) <- c("a", "b", "c")  # This names each element of the vector
  return(result)
}
# no arguments
display()
OUTPUT
a b c
1 2 3 
R
# one argument
display(55)
OUTPUT
 a  b  c
55  2  3 
R
# two arguments
display(55, 66)
OUTPUT
 a  b  c
55 66  3 
R
# three arguments
display(55, 66, 77)
OUTPUT
 a  b  c
55 66 77 
As this example shows, arguments are matched from left to right, and any that haven’t been given a value explicitly get their default value. We can override this behavior by naming the value as we pass it in:
R
# only setting the value of c
display(c = 77)
OUTPUT
 a  b  c
 1  2 77 
Matching Arguments
To be precise, R has three ways that arguments supplied by you are matched to the formal arguments of the function definition:
- by complete name,
 - by partial name (matching on initial n characters of the argument name), and
 - by position.
 
Arguments are matched in the manner outlined above in that order: by complete name, then by partial matching of names, and finally by position.
With that in hand, let’s look at the help for
read.csv():
R
?read.csv
There’s a lot of information there, but the most important part is the first couple of lines:
R
read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)
This tells us that read.csv() has one argument,
file, that doesn’t have a default value, and six others
that do. Now we understand why the following gives an error:
R
dat <- read.csv(FALSE, "data/inflammation-01.csv")
ERROR
Error in read.table(file = file, header = header, sep = sep, quote = quote, : 'file' must be a character string or connection
It fails because FALSE is assigned to file
and the filename is assigned to the argument header.
A Function with Default Argument Values
Rewrite the rescale function so that it scales a vector
to lie between 0 and 1 by default, but will allow the caller to specify
lower and upper bounds if they want. Compare your implementation to your
neighbor’s: Do your two implementations produce the same results when
both are given the same input vector and parameters?
R
rescale <- function(v, lower = 0, upper = 1) {
  # Rescales a vector, v, to lie in the range lower to upper.
  L <- min(v)
  H <- max(v)
  result <- (v - L) / (H - L) * (upper - lower) + lower
  return(result)
}
- Define a function using
name <- function(...args...) {...body...}. - Call a function using 
name(...values...). - R looks for variables in the current stack frame before looking for them at the top level.
 - Use 
help(thing)to view help for something. - Put comments at the beginning of functions to provide help for that function.
 - Annotate your code!
 - Specify default values for arguments when defining a function using
name = valuein the argument list. - Arguments can be passed by matching based on name, by position, or by omitting them (in which case the default value is used).
 
Content from Analyzing Multiple Data Sets
Last updated on 2025-10-07 | Edit this page
Estimated time: 40 minutes
Overview
Questions
- How can I do the same thing to multiple data sets?
 - How do I write a 
forloop? 
Objectives
- Explain what a 
forloop does. - Correctly write 
forloops to repeat simple calculations. - Trace changes to a loop variable as the loop runs.
 - Trace changes to other variables as they are updated by a
forloop. - Use a function to get a list of filenames that match a simple pattern.
 - Use a 
forloop to process multiple files. 
We have created a function called analyze that creates
graphs of the minimum, average, and maximum daily inflammation rates for
a single data set:
R
analyze <- function(filename) {
  # Plots the average, min, and max inflammation over time.
  # The argument or input is a character string representing the name and location of a CSV file.
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
}
analyze("data/inflammation-01.csv")
We can use it to analyze other data sets one by one:
R
analyze("data/inflammation-02.csv")
but we have a dozen data sets right now and more on the way. We want to create plots for all our data sets with a single statement. To do that, we’ll have to teach the computer how to repeat things.
For Loops
Suppose we want to print each word in a sentence. One way is to use
six print statements:
R
best_practice <- c("Let", "the", "computer", "do", "the", "work")
print_words <- function(sentence) {
  print(sentence[1])
  print(sentence[2])
  print(sentence[3])
  print(sentence[4])
  print(sentence[5])
  print(sentence[6])
}
print_words(best_practice)
OUTPUT
[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
[1] "work"
but that’s a bad approach for two reasons:
It doesn’t scale: if we want to print the elements in a vector that’s hundreds long, we’d be better off just typing them in.
It’s fragile: if we give it a longer vector, it only prints part of the data, and if we give it a shorter input, it returns
NAvalues because we’re asking for elements that don’t exist!
R
best_practice[-6]
OUTPUT
[1] "Let"      "the"      "computer" "do"       "the"     
R
print_words(best_practice[-6])
OUTPUT
[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
[1] NA
Not Available
R has has a special variable, NA, for designating
missing values that are Not Available
in a data set. See ?NA and An
Introduction to R for more details.
Here’s a better approach:
R
print_words <- function(sentence) {
  for (word in sentence) {
    print(word)
  }
}
print_words(best_practice)
OUTPUT
[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
[1] "work"
This is shorter - certainly shorter than something that prints every character in a hundred-letter string - and more robust as well:
R
print_words(best_practice[-6])
OUTPUT
[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
The improved version of print_words uses a for loop to repeat an operation—in
this case, printing—once for each thing in a collection. The general
form of a loop is:
We can name the loop
variable anything we like (with a few restrictions,
e.g. the name of the variable cannot start with a digit).
in is part of the for syntax. Note that the
condition (variable in collection) is enclosed in
parentheses, and the body of the loop is enclosed in curly braces
{ }. For a single-line loop body, as here, the braces
aren’t needed, but it is good practice to include them as we did.
Here’s another loop that repeatedly updates a variable:
R
len <- 0
vowels <- c("a", "e", "i", "o", "u")
for (v in vowels) {
  len <- len + 1
}
# Number of vowels
len
OUTPUT
[1] 5
It’s worth tracing the execution of this little program step by step.
Since there are five elements in the vector vowels, the
statement inside the loop will be executed five times. The first time
around, len is zero (the value assigned to it on line 1)
and v is "a". The statement adds 1 to the old
value of len, producing 1, and updates len to
refer to that new value. The next time around, v is
"e" and len is 1, so len is
updated to be 2. After three more updates, len is 5; since
there is nothing left in the vector vowels for R to
process, the loop finishes.
Note that a loop variable is just a variable that’s being used to record progress in a loop. It still exists after the loop is over, and we can re-use variables previously defined as loop variables as well:
R
letter <- "z"
for (letter in c("a", "b", "c")) {
  print(letter)
}
OUTPUT
[1] "a"
[1] "b"
[1] "c"
R
# after the loop, letter is
letter
OUTPUT
[1] "c"
Note also that finding the length of a vector is such a common
operation that R actually has a built-in function to do it called
length:
R
length(vowels)
OUTPUT
[1] 5
length is much faster than any R function we could write
ourselves, and much easier to read than a two-line loop; it will also
give us the length of many other things that we haven’t met yet, so we
should always use it when we can (see this lesson to learn more about the
different ways to store data in R).
Printing Numbers
R has a built-in function called seq that creates a list
of numbers:
R
seq(3)
OUTPUT
[1] 1 2 3
Using seq, write a function that prints the first
N natural numbers, one per line:
R
print_N(3)
OUTPUT
[1] 1
[1] 2
[1] 3
R
print_N <- function(N) {
  nseq <- seq(N)
  for (num in nseq) {
    print(num)
  }
}
Summing Values
Write a function called total that calculates the sum of
the values in a vector. (R has a built-in function called
sum that does this for you. Please don’t use it for this
exercise.)
R
ex_vec <- c(4, 8, 15, 16, 23, 42)
total(ex_vec)
OUTPUT
[1] 108
R
total <- function(vec) {
  # calculates the sum of the values in a vector
  vec_sum <- 0
  for (num in vec) {
    vec_sum <- vec_sum + num
  }
  return(vec_sum)
}
Exponentiation
Exponentiation is built into R:
R
2^4
OUTPUT
[1] 16
Write a function called expo that uses a loop to
calculate the same result.
R
expo(2, 4)
OUTPUT
[1] 16
R
expo <- function(base, power) {
  result <- 1
  for (i in seq(power)) {
    result <- result * base
  }
  return(result)
}
Processing Multiple Files
We now have almost everything we need to process all our data files.
The only thing that’s missing is a function that finds files whose names
match a pattern. We do not need to write it ourselves because R already
has a function to do this called list.files.
If we run the function without any arguments,
list.files(), it returns every file in the current working
directory. We can understand this result by reading the help file
(?list.files). The first argument, path, is
the path to the directory to be searched, and it has the default value
of "." (recall from the lesson
on the Unix Shell that "." is shorthand for the current
working directory). The second argument, pattern, is the
pattern being searched, and it has the default value of
NULL. Since no pattern is specified to filter the files,
all files are returned.
So to list all the csv files, we could run either of the following:
R
list.files(path = "data", pattern = "csv")
OUTPUT
 [1] "car-speeds-cleaned.csv" "car-speeds.csv"         "inflammation-01.csv"
 [4] "inflammation-02.csv"    "inflammation-03.csv"    "inflammation-04.csv"
 [7] "inflammation-05.csv"    "inflammation-06.csv"    "inflammation-07.csv"
[10] "inflammation-08.csv"    "inflammation-09.csv"    "inflammation-10.csv"
[13] "inflammation-11.csv"    "inflammation-12.csv"    "sample.csv"
[16] "small-01.csv"           "small-02.csv"           "small-03.csv"          
R
list.files(path = "data", pattern = "inflammation")
OUTPUT
 [1] "inflammation-01.csv"            "inflammation-02.csv"
 [3] "inflammation-03.csv"            "inflammation-04.csv"
 [5] "inflammation-05.csv"            "inflammation-06.csv"
 [7] "inflammation-07.csv"            "inflammation-08.csv"
 [9] "inflammation-09.csv"            "inflammation-10.csv"
[11] "inflammation-11.csv"            "inflammation-12.csv"
[13] "r-novice-inflammation-data.zip"
Organizing Larger Projects
For larger projects, it is recommended to organize separate parts of the analysis into multiple subdirectories, e.g. one subdirectory for the raw data, one for the code, and one for the results like figures. We have done that here to some extent, putting all of our data files into the subdirectory “data”. For more advice on this topic, you can read A quick guide to organizing computational biology projects by William Stafford Noble.
As these examples show, list.files result is a vector of
strings, which means we can loop over it to do something with each
filename in turn. In our case, the “something” we want is our
analyze function.
Because we have put our data in a separate subdirectory, if we want
to access these files using the output of list.files we
also need to include the “path” portion of the file name. We can do that
by using the argument full.names = TRUE.
R
list.files(path = "data", pattern = "csv", full.names = TRUE)
OUTPUT
 [1] "data/car-speeds-cleaned.csv" "data/car-speeds.csv"
 [3] "data/inflammation-01.csv"    "data/inflammation-02.csv"
 [5] "data/inflammation-03.csv"    "data/inflammation-04.csv"
 [7] "data/inflammation-05.csv"    "data/inflammation-06.csv"
 [9] "data/inflammation-07.csv"    "data/inflammation-08.csv"
[11] "data/inflammation-09.csv"    "data/inflammation-10.csv"
[13] "data/inflammation-11.csv"    "data/inflammation-12.csv"
[15] "data/sample.csv"             "data/small-01.csv"
[17] "data/small-02.csv"           "data/small-03.csv"          
R
list.files(path = "data", pattern = "inflammation", full.names = TRUE)
OUTPUT
 [1] "data/inflammation-01.csv"            "data/inflammation-02.csv"
 [3] "data/inflammation-03.csv"            "data/inflammation-04.csv"
 [5] "data/inflammation-05.csv"            "data/inflammation-06.csv"
 [7] "data/inflammation-07.csv"            "data/inflammation-08.csv"
 [9] "data/inflammation-09.csv"            "data/inflammation-10.csv"
[11] "data/inflammation-11.csv"            "data/inflammation-12.csv"
[13] "data/r-novice-inflammation-data.zip"
Let’s test out running our analyze function by using it
on the first three files in the vector returned by
list.files:
R
filenames <- list.files(path = "data",
                        # Now follows a regular expression that matches:
                        pattern = "inflammation-[0-9]{2}.csv",
                        #          |            |        the standard file extension of comma-separated values
                        #          |            the variable parts (two digits, each between 0 and 9)
                        #          the static part of the filenames
                        full.names = TRUE)
filenames <- filenames[1:3]
for (f in filenames) {
  print(f)
  analyze(f)
}
OUTPUT
[1] "data/inflammation-01.csv"
OUTPUT
[1] "data/inflammation-02.csv"
OUTPUT
[1] "data/inflammation-03.csv"
Sure enough, the maxima of these data sets show exactly the same ramp as the first, and their minima show the same staircase structure.
Other Ways to Do It
In this lesson we saw how to use a simple for loop to
repeat an operation. As you progress with R, you will learn that there
are multiple ways to accomplish this. Sometimes the choice of one method
over another is more a matter of personal style, but other times it can
have consequences for the speed of your code. For instruction on best
practices, see this supplementary lesson that demonstrates how to
properly repeat operations in R.
Using Loops to Analyze Multiple Files
Write a function called analyze_all that takes a folder
path and a filename pattern as its arguments and runs
analyze for each file whose name matches the pattern.
R
analyze_all <- function(folder = "data", pattern) {
  # Runs the function analyze for each file in the given folder
  # that contains the given pattern.
  filenames <- list.files(path = folder, pattern = pattern, full.names = TRUE)
  for (f in filenames) {
    analyze(f)
  }
}
- Use 
for (variable in collection)to process the elements of a collection one at a time. - The body of a 
forloop is surrounded by curly braces ({}). - Use 
length(thing)to determine the length of something that contains other values. - Use
list.files(path = "path", pattern = "pattern", full.names = TRUE)to create a list of files whose names match a pattern. 
Content from Making Choices
Last updated on 2025-10-07 | Edit this page
Estimated time: 40 minutes
Overview
Questions
- How do I make choices using 
ifandelsestatements? - How do I compare values?
 - How do I save my plots to a PDF file?
 
Objectives
- Save plot(s) in a PDF file.
 - Write conditional statements with 
ifandelse. - Correctly evaluate expressions containing 
&&(‘and’) and||(‘or’). 
Our previous lessons have shown us how to manipulate data, define our own functions, and repeat things. However, the programs we have written so far always do the same things, regardless of what data they’re given. We want programs to make choices based on the values they are manipulating.
Saving Plots to a File
So far, we have built a function analyze to plot summary
statistics of the inflammation data:
R
analyze <- function(filename) {
  # Plots the average, min, and max inflammation over time.
  # Input is character string of a csv file.
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
}
And also built the function analyze_all to automate the
processing of each data file:
R
analyze_all <- function(folder = "data", pattern) {
  # Runs the function analyze for each file in the given folder
  # that contains the given pattern.
  filenames <- list.files(path = folder, pattern = pattern, full.names = TRUE)
  for (f in filenames) {
    analyze(f)
  }
}
While these are useful in an interactive R session, what if we want
to send our results to our collaborators? Since we currently have 12
data sets, running analyze_all(pattern = inflammation.*csv)
creates 36 plots. Saving each of these individually would be tedious and
error-prone. And in the likely situation that we want to change how the
data is processed or the look of the plots, we would have to once again
save all 36 before sharing the updated results with our
collaborators.
Here’s how we can save all three plots of the first inflammation data set in a pdf file:
R
pdf("inflammation-01.pdf")
analyze("data/inflammation-01.csv")
dev.off()
The function pdf redirects all the plots generated by R
into a pdf file, which in this case we have named “inflammation-01.pdf”.
After we are done generating the plots to be saved in the pdf file, we
stop R from redirecting plots with the function
dev.off.
Overwriting Plots
If you run pdf multiple times without running
dev.off, you will save plots to the most recently opened
file. However, you won’t be able to open the previous pdf files because
the connections were not closed. In order to get out of this situation,
you’ll need to run dev.off until all the pdf connections
are closed. You can check your current status using the function
dev.cur. If it says “pdf”, all your plots are being saved
in the last pdf specified. If it says “null device” or “RStudioGD”, the
plots will be visualized normally.
We can update the analyze function so that it always
saves the plots in a pdf. But that would make it more difficult to
interactively test out new changes. It would be ideal if
analyze would either save or not save the plots based on
its input.
Conditionals
In order to update our function to decide between saving or not, we need to write code that automatically decides between multiple options. The computer can make these decisions through logical comparisons.
R
num <- 37
num > 100
OUTPUT
[1] FALSE
As 37 is not greater than 100, this returns a FALSE
object. And as you likely guessed, the opposite of FALSE is
TRUE.
R
num < 100
OUTPUT
[1] TRUE
We pair these logical comparison tools with what R calls a conditional statement, and it looks like this:
R
num <- 37
if (num > 100) {
  print("greater")
} else {
  print("not greater")
}
print("done")
OUTPUT
[1] "not greater"
[1] "done"
The second line of this code uses an if statement to
tell R that we want to make a choice. If the following test is
TRUE, the body of the if (i.e., the lines in
the curly braces underneath it) are executed. If the test is
FALSE, the body of the else is executed
instead. Only one or the other is ever executed:
In the example above, the test num > 100 returns the
value FALSE, which is why the code inside the
if block was skipped and the code inside the
else statement was run instead.
Conditional statements don’t have to include an else. If
there isn’t one, R simply does nothing if the test is false:
R
num <- 53
if (num > 100) {
  print("num is greater than 100")
}
We can also chain several tests together when there are more than two options. This makes it simple to write a function that returns the sign of a number:
R
sign <- function(num) {
  if (num > 0) {
    return(1)
  } else if (num == 0) {
    return(0)
  } else {
    return(-1)
  }
}
sign(-3)
OUTPUT
[1] -1
R
sign(0)
OUTPUT
[1] 0
R
sign(2/3)
OUTPUT
[1] 1
Note that when combining else and if in an
else if statement, the if portion still
requires a direct input condition. This is never the case for the
else statement alone, which is only executed if all other
conditions go unsatisfied. Note that the test for equality uses two
equal signs, ==.
Other Comparisons
Other tests include greater than or equal to (>=),
less than or equal to (<=), and not equal to
(!=).
We can also combine tests. Two ampersands, &&,
symbolize “and”. Two vertical bars, ||, symbolize “or”.
&& is only true if both parts are true:
R
if (1 > 0 && -1 > 0) {
  print("both parts are true")
} else {
  print("at least one part is not true")
}
OUTPUT
[1] "at least one part is not true"
while || is true if either part is true:
R
if (1 > 0 || -1 > 0) {
  print("at least one part is true")
} else {
  print("neither part is true")
}
OUTPUT
[1] "at least one part is true"
In this case, “either” means “either or both”, not “either one or the other but not both”.
Special case of the NA
variable
You may remember from the previous lesson that R has a special
variable, NA, for designating missing values. Because it
represents the absence of a value, we will not be able to test its
equality or inequality with another value. Such tests always return
NA:
R
a <- NA
a == 1
OUTPUT
[1] NA
R
a == NA
OUTPUT
[1] NA
We need to be particularly careful in tests because trying to compare
a NA value will result in an error:
R
if (a == NA) {
  print("Hi!")
}
ERROR
Error in if (a == NA) {: missing value where TRUE/FALSE needed
To solve this issue, we need to use the dedicated
is.na() function:
R
is.na(a)
OUTPUT
[1] TRUE
R
if (is.na(a)) {
  print("Hi!")
}
OUTPUT
[1] "Hi!"
Choosing Plots Based on Data
Write a function plot_dist that plots a boxplot if the
length of the vector is greater than a specified threshold and a
stripchart otherwise. To do this you’ll use the R functions
boxplot and stripchart.
R
dat <- read.csv("data/inflammation-01.csv", header = FALSE)
plot_dist(dat[, 10], threshold = 10)     # day (column) 10
R
plot_dist(dat[1:5, 10], threshold = 10)  # samples (rows) 1-5 on day (column) 10
R
plot_dist <- function(x, threshold) {
  if (length(x) > threshold) {
    boxplot(x)
  } else {
    stripchart(x)
  }
}
Histograms Instead
One of your collaborators prefers to see the distributions of the
larger vectors as a histogram instead of as a boxplot. In order to
choose between a histogram and a boxplot we will edit the function
plot_dist and add an additional argument
use_boxplot. By default we will set
use_boxplot to TRUE which will create a
boxplot when the vector is longer than threshold. When
use_boxplot is set to FALSE,
plot_dist will instead plot a histogram for the larger
vectors. As before, if the length of the vector is shorter than
threshold, plot_dist will create a stripchart.
A histogram is made with the hist command in R.
R
dat <- read.csv("data/inflammation-01.csv", header = FALSE)
plot_dist(dat[, 10], threshold = 10, use_boxplot = TRUE)   # day (column) 10 - create boxplot
R
plot_dist(dat[, 10], threshold = 10, use_boxplot = FALSE)  # day (column) 10 - create histogram
R
plot_dist(dat[1:5, 10], threshold = 10)                    # samples (rows) 1-5 on day (column) 10
R
plot_dist <- function(x, threshold, use_boxplot = TRUE) {
   if (length(x) > threshold && use_boxplot) {
       boxplot(x)
   } else if (length(x) > threshold && !use_boxplot) {
       hist(x)
   } else {
       stripchart(x)
   }
}
Find the Maximum Inflammation Score
Find the file containing the patient with the highest average inflammation score. Print the file name, the patient number (row number) and the value of the maximum average inflammation score.
Tips:
- Use variables to store the maximum average and update it as you go through files and patients.
 - You can use nested loops (one loop is inside the other) to go through the files as well as through the patients in each file (every row).
 
Complete the code below:
R
filenames <- list.files(path = "data", pattern = "inflammation-[0-9]{2}.csv", full.names = TRUE)
filename_max <- "" # filename where the maximum average inflammation patient is found
patient_max <- 0 # index (row number) for this patient in this file
average_inf_max <- 0 # value of the average inflammation score for this patient
for (f in filenames) {
  dat <- read.csv(file = f, header = FALSE)
  dat.means <- apply(dat, 1, mean)
  for (patient_index in 1:length(dat.means)){
    patient_average_inf <- dat.means[patient_index]
    # Add your code here ...
  }
}
print(filename_max)
print(patient_max)
print(average_inf_max)
R
# Add your code here ...
if (patient_average_inf > average_inf_max) {
  average_inf_max <- patient_average_inf
  filename_max <- f
  patient_max <- patient_index
}
Saving Automatically Generated Figures
Now that we know how to have R make decisions based on input values,
let’s update analyze:
R
analyze <- function(filename, output = NULL) {
  # Plots the average, min, and max inflammation over time.
  # Input:
  #    filename: character string of a csv file
  #    output: character string of pdf file for saving
  if (!is.null(output)) {
    pdf(output)
  }
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
  if (!is.null(output)) {
    dev.off()
  }
}
We added an argument, output, that by default is set to
NULL. An if statement at the beginning checks
the argument output to decide whether or not to save the
plots to a pdf. Let’s break it down. The function is.null
returns TRUE if a variable is NULL and
FALSE otherwise. The exclamation mark, !,
stands for “not”. Therefore the line in the if block is
only executed if output is “not null”.
R
output <- NULL
is.null(output)
OUTPUT
[1] TRUE
R
!is.null(output)
OUTPUT
[1] FALSE
Now we can use analyze interactively, as before,
R
analyze("data/inflammation-01.csv")
but also use it to save plots,
R
analyze("data/inflammation-01.csv", output = "inflammation-01.pdf")
Before going further, we will create a directory results
for saving our plots. It is good
practice in data analysis projects to save all output to a directory
separate from the data and analysis code. You can create this directory
using the shell command mkdir,
or the R function dir.create()
R
dir.create("results")
Now run analyze and save the plot in the
results directory,
R
analyze("data/inflammation-01.csv", output = "results/inflammation-01.pdf")
This now works well when we want to process one data file at a time,
but how can we specify the output file in analyze_all? We
need to do two things:
- Substitute the filename ending “csv” with “pdf”.
 - Save the plot to the 
resultsdirectory. 
To change the extension to “pdf”, we will use the function
sub,
R
f <- "inflammation-01.csv"
sub("csv", "pdf", f)
OUTPUT
[1] "inflammation-01.pdf"
To add the “results” directory to the filename use the function
file.path,
R
file.path("results", sub("csv", "pdf", f))
OUTPUT
[1] "results/inflammation-01.pdf"
Now let’s update analyze_all:
R
analyze_all <- function(pattern) {
  # Directory name containing the data
  data_dir <- "data"
  # Directory name for results
  results_dir <- "results"
  # Runs the function analyze for each file in the current working directory
  # that contains the given pattern.
  filenames <- list.files(path = data_dir, pattern = pattern)
  for (f in filenames) {
    pdf_name <- file.path(results_dir, sub("csv", "pdf", f))
    analyze(file.path(data_dir, f), output = pdf_name)
  }
}
Now we can save all of the results with just one line of code:
R
analyze_all("inflammation.*csv")
Now if we need to make any changes to our analysis, we can edit the
analyze function and quickly regenerate all the figures
with analyze_all.
Changing the Behavior of the Plot Command
One of your collaborators asks if you can recreate the figures with
lines instead of points. Find the relevant argument to plot
by reading the documentation (?plot), update
analyze, and then recreate all the figures with
analyze_all.
R
analyze <- function(filename, output = NULL) {
  # Plots the average, min, and max inflammation over time.
  # Input:
  #    filename: character string of a csv file
  #    output: character string of pdf file for saving
  if (!is.null(output)) {
    pdf(output)
  }
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation, type = "l")
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation, type = "l")
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation, type = "l")
  if (!is.null(output)) {
    dev.off()
  }
}
- Save a plot in a pdf file using 
pdf("name.pdf")and stop writing to the pdf file withdev.off(). - Use 
if (condition)to start a conditional statement,else if (condition)to provide additional tests, andelseto provide a default. - The bodies of conditional statements must be surrounded by curly
braces 
{ }. - Use 
==to test for equality. - 
X && Yis only true if both X and Y are true. - 
X || Yis true if either X or Y, or both, are true. 
Content from Command-Line Programs
Last updated on 2025-10-07 | Edit this page
Estimated time: 80 minutes
Overview
Questions
- How do I write a command-line script?
 - How do I read in arguments from the command-line?
 
Objectives
- Use the values of command-line arguments in a program.
 - Handle flags and files separately in a command-line program.
 - Read data from standard input in a program so that it can be used in a pipeline.
 
The R Console and other interactive tools like RStudio are great for prototyping code and exploring data, but sooner or later we will want to use our program in a pipeline or run it in a shell script to process thousands of data files. In order to do that, we need to make our programs work like other Unix command-line tools. For example, we may want a program that reads a data set and prints the average inflammation per patient:
but we might also want to look at the minimum of the first four lines
or the maximum inflammations in several files one after another:
Our overall requirements are:
- If no filename is given on the command line, read data from standard input.
 - If one or more filenames are given, read data from them and report statistics for each file separately.
 - Use the 
--min,--mean, or--maxflag to determine what statistic to print. 
To make this work, we need to know how to handle command-line arguments in a program, and how to get at standard input. We’ll tackle these questions in turn below.
Command-Line Arguments
Using the text editor of your choice, save the following line of code
in a text file called session-info.R:
OUTPUT
sessionInfo()
The function, sessionInfo, outputs the version of R you
are running as well as the type of computer you are using (as well as
the versions of the packages that have been loaded). This is very useful
information to include when asking others for help with your R code.
Now we can run the code in the file we created from the Unix Shell
using Rscript:
OUTPUT
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS
Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
loaded via a namespace (and not attached):
[1] compiler_4.5.1
The Right Directory
If that did not work, you might have seen an error message indicating
that the file session-info.R does not exist. Remember that
you must be in the correct directory, the one in which you created your
script file. You can determine which directory you are currently in
using pwd and change to a different directory using
cd. For a review, see this lesson.
Now let’s create another script that does something more interesting.
Write the following lines in a file named print-args.R:
OUTPUT
args <- commandArgs()
cat(args, sep = "\n")
The function commandArgs extracts all the command line
arguments and returns them as a vector. The function cat,
similar to the cat of the Unix Shell, outputs the contents
of the variable. Since we did not specify a filename for writing,
cat sends the output to standard output, which
we can then pipe to other Unix functions. Because we set the argument
sep to "\n", which is the symbol to start a
new line, each element of the vector is printed on its own line. Let’s
see what happens when we run this program in the Unix Shell:
OUTPUT
/usr/local/lib/R/bin/exec/R
--no-save
--no-restore
--no-echo
--no-restore
--file=print-args.R
From this output, we learn that Rscript is just a
convenience command for running R scripts. The first argument in the
vector is the path to the R executable. The following are
all command-line arguments that affect the behavior of R. From the R
help file:
- 
--no-echo: Make R run as quietly as possible - 
--no-restore: Don’t restore anything that was created during the R session - 
--file: Run this file - 
--args: Pass these arguments to the file being run 
Thus running a file with Rscript is an easier way to run the following:
OUTPUT
/usr/local/lib/R/bin/exec/R
--no-save
--no-restore
--no-echo
--no-restore
--file=print-args.R
--args
If we run it with a few arguments, however:
OUTPUT
/usr/local/lib/R/bin/exec/R
--no-save
--no-restore
--no-echo
--no-restore
--file=print-args.R
--args
first
second
third
then commandArgs adds each of those arguments to the
vector it returns. Since the first elements of the vector are always the
same, we can tell commandArgs to only return the arguments
that come after --args. Let’s update
print-args.R and save it as
print-args-trailing.R:
OUTPUT
args <- commandArgs(trailingOnly = TRUE)
cat(args, sep = "\n")
And then run print-args-trailing from the Unix
Shell:
OUTPUT
first
second
third
Now commandArgs returns only the arguments that we
listed after print-args-trailing.R.
With this in hand, let’s build a version of readings.R
that always prints the per-patient (per-row) mean of a single data file.
The first step is to write a function that outlines our implementation,
and a placeholder for the function that does the actual work. By
convention this function is usually called main, though we
can call it whatever we want. Write the following code in a file called
readings-01.R:
OUTPUT
main <- function() {
  args <- commandArgs(trailingOnly = TRUE)
  filename <- args[1]
  dat <- read.csv(file = filename, header = FALSE)
  mean_per_patient <- apply(dat, 1, mean)
  cat(mean_per_patient, sep = "\n")
}
This function gets the name of the file to process from the first
element returned by commandArgs. Here’s a simple test to
run from the Unix Shell:
There is no output because we have defined a function, but haven’t
actually called it. Let’s add a call to main and save it as
readings-02.R:
OUTPUT
main <- function() {
  args <- commandArgs(trailingOnly = TRUE)
  filename <- args[1]
  dat <- read.csv(file = filename, header = FALSE)
  mean_per_patient <- apply(dat, 1, mean)
  cat(mean_per_patient, sep = "\n")
}
main()
OUTPUT
5.45
5.425
6.1
5.9
5.55
6.225
5.975
6.65
6.625
6.525
6.775
5.8
6.225
5.75
5.225
6.3
6.55
5.7
5.85
6.55
5.775
5.825
6.175
6.1
5.8
6.425
6.05
6.025
6.175
6.55
6.175
6.35
6.725
6.125
7.075
5.725
5.925
6.15
6.075
5.75
5.975
5.725
6.3
5.9
6.75
5.925
7.225
6.15
5.95
6.275
5.7
6.1
6.825
5.975
6.725
5.7
6.25
6.4
7.05
5.9
A Simple Command-Line Program
- Write a command-line program that does addition and subtraction of two numbers.
 
Everything argument read from the command-line is interpreted as a
character string. You can convert
from a string to a number using the function
as.numeric.
OUTPUT
3
OUTPUT
-1
OUTPUT
main <- function() {
  # Performs addition or subtraction from the command line.
  #
  # Takes three arguments:
  # The first and third are the numbers.
  # The second is either + for addition or - for subtraction.
  #
  # Ex. usage:
  #   Rscript arith.R 1 + 2
  #   Rscript arith.R 3 - 4
  #
  args <- commandArgs(trailingOnly = TRUE)
  num1 <- as.numeric(args[1])
  operation <- args[2]
  num2 <- as.numeric(args[3])
  if (operation == "+") {
    answer <- num1 + num2
    cat(answer)
  } else if (operation == "-") {
    answer <- num1 - num2
    cat(answer)
  } else {
    stop("Invalid input. Use + for addition or - for subtraction.")
  }
}
main()
A Simple Command-Line Program (continued)
- What goes wrong if you try to add multiplication using
*to the program? 
An error message is returned due to “invalid input.” This is likely because ‘*’ has a special meaning in the shell, as a wildcard.
A Simple Command-Line Program (continued)
- Using the function 
list.filesintroduced in a previous lesson, write a command-line program calledfind-pattern.Rthat lists all the files in the current directory that contain a specific pattern: 
BASH
# For example, searching for the pattern "print-args" returns the two scripts we wrote earlier
Rscript find-pattern.R print-args
OUTPUT
print-args-trailing.R
print-args.R
OUTPUT
main <- function() {
  # Finds all files in the current directory that contain a given pattern.
  #
  # Takes one argument: the pattern to be searched.
  #
  # Ex. usage:
  #   Rscript find-pattern.R csv
  #
  args <- commandArgs(trailingOnly = TRUE)
  pattern <- args[1]
  files <- list.files(pattern = pattern)
  cat(files, sep = "\n")
}
main()
Handling Multiple Files
The next step is to teach our program how to handle multiple files. Since 60 lines of output per file is a lot to page through, we’ll start by using three smaller files, each of which has three days of data for two patients. Let’s investigate them from the Unix Shell:
OUTPUT
data/small-01.csv
data/small-02.csv
data/small-03.csv
OUTPUT
0,0,1
0,1,2
OUTPUT
0.3333333
1
Using small data files as input also allows us to check our results more easily: here, for example, we can see that our program is calculating the mean correctly for each line, whereas we were really taking it on faith before. This is yet another rule of programming: test the simple things first.
We want our program to process each file separately, so we need a
loop that executes once for each filename. If we specify the files on
the command line, the filenames will be returned by
commandArgs(trailingOnly = TRUE). We’ll need to handle an
unknown number of filenames, since our program could be run for any
number of files.
The solution is to loop over the vector returned by
commandArgs(trailingOnly = TRUE). Here’s our changed
program, which we’ll save as readings-03.R:
OUTPUT
main <- function() {
  args <- commandArgs(trailingOnly = TRUE)
  for (filename in args) {
    dat <- read.csv(file = filename, header = FALSE)
    mean_per_patient <- apply(dat, 1, mean)
    cat(mean_per_patient, sep = "\n")
  }
}
main()
and here it is in action:
OUTPUT
0.3333333
1
13.66667
11
Note: at this point, we have created three versions
of our script called readings-01.R,
readings-02.R, and readings-03.R. We wouldn’t
do this in real life: instead, we would have one file called
readings.R that we committed to version control every time
we got an enhancement working. For teaching, though, we need all the
successive versions side by side.
A Command Line Program with Arguments
Write a program called check.R that takes the names of
one or more inflammation data files as arguments and checks that all the
files have the same number of rows and columns. What is the best way to
test your program?
OUTPUT
main <- function() {
  # Checks that all csv files have the same number of rows and columns.
  #
  # Takes multiple arguments: the names of the files to be checked.
  #
  # Ex. usage:
  #   Rscript check.R inflammation-*
  #
  args <- commandArgs(trailingOnly = TRUE)
  first_file <- read.csv(args[1], header = FALSE)
  first_dim <- dim(first_file)
#   num_rows <- dim(first_file)[1]  # nrow(first_file)
#   num_cols <- dim(first_file)[2]  # ncol(first_file)
  for (filename in args[-1]) {
    new_file <- read.csv(filename, header = FALSE)
    new_dim <- dim(new_file)
    if (new_dim[1] != first_dim[1] | new_dim[2] != first_dim[2]) {
      cat("Not all the data files have the same dimensions.")
    }
  }
}
main()
Handling Command-Line Flags
The next step is to teach our program to pay attention to the
--min, --mean, and --max flags.
These always appear before the names of the files, so let’s save the
following in readings-04.R:
OUTPUT
main <- function() {
  args <- commandArgs(trailingOnly = TRUE)
  action <- args[1]
  filenames <- args[-1]
  for (f in filenames) {
    dat <- read.csv(file = f, header = FALSE)
    if (action == "--min") {
      values <- apply(dat, 1, min)
    } else if (action == "--mean") {
      values <- apply(dat, 1, mean)
    } else if (action == "--max") {
      values <- apply(dat, 1, max)
    }
    cat(values, sep = "\n")
  }
}
main()
And we can confirm this works by running it from the Unix Shell:
OUTPUT
1
2
but there are several things wrong with it:
mainis too large to read comfortably.If
actionisn’t one of the three recognized flags, the program loads each file but does nothing with it (because none of the branches in the conditional match). Silent failures like this are always hard to debug.
This version pulls the processing of each file out of the loop into a
function of its own. It also uses stopifnot to check that
action is one of the allowed flags before doing any
processing, so that the program fails fast. We’ll save it as
readings-05.R:
OUTPUT
main <- function() {
  args <- commandArgs(trailingOnly = TRUE)
  action <- args[1]
  filenames <- args[-1]
  stopifnot(action %in% c("--min", "--mean", "--max"))
  for (f in filenames) {
    process(f, action)
  }
}
process <- function(filename, action) {
  dat <- read.csv(file = filename, header = FALSE)
  if (action == "--min") {
    values <- apply(dat, 1, min)
  } else if (action == "--mean") {
    values <- apply(dat, 1, mean)
  } else if (action == "--max") {
    values <- apply(dat, 1, max)
  }
  cat(values, sep = "\n")
}
main()
This is four lines longer than its predecessor, but broken into more digestible chunks of 8 and 12 lines.
Parsing Command-Line Flags
R has a package named argparse that helps handle complex command-line flags (it utilizes a Python module of the same name). We will not cover this package in this lesson but when you start writing programs with multiple parameters you’ll want to read through the package’s vignette.
Shorter Command Line Arguments
Rewrite this program so that it uses -n,
-m, and -x instead of --min,
--mean, and --max respectively. Is the code
easier to read? Is the program easier to understand?
Separately, modify the program so that if no action is specified (or an incorrect action is given), it prints a message explaining how it should be used.
OUTPUT
main <- function() {
  args <- commandArgs(trailingOnly = TRUE)
  action <- args[1]
  filenames <- args[-1]
  stopifnot(action %in% c("-n", "-m", "-x"))
  for (f in filenames) {
    process(f, action)
  }
}
process <- function(filename, action) {
  dat <- read.csv(file = filename, header = FALSE)
  if (action == "-n") {
    values <- apply(dat, 1, min)
  } else if (action == "-m") {
    values <- apply(dat, 1, mean)
  } else if (action == "-x") {
    values <- apply(dat, 1, max)
  }
  cat(values, sep = "\n")
}
main()
The program is neither easier to read nor easier to understand due to the ambiguity of the argument names.
OUTPUT
main <- function() {
  args <- commandArgs(trailingOnly = TRUE)
  action <- args[1]
  filenames <- args[-1]
  if (!(action %in% c("--min", "--mean", "--max"))) {
    usage()
  } else {
    for (f in filenames) {
      process(f, action)
    }
  }
}
process <- function(filename, action) {
  dat <- read.csv(file = filename, header = FALSE)
  if (action == "--min") {
    values <- apply(dat, 1, min)
  } else if (action == "--mean") {
    values <- apply(dat, 1, mean)
  } else if (action == "--max") {
    values <- apply(dat, 1, max)
  }
  cat(values, sep = "\n")
}
usage <- function() {
  cat("usage: Rscript readings-usage.R [--min, --mean, --max] filenames", sep = "\n")
}
main()
Handling Standard Input
The next thing our program has to do is read data from standard input
if no filenames are given so that we can put it in a pipeline, redirect
input to it, and so on. Let’s experiment in another script, which we’ll
save as count-stdin.R:
OUTPUT
lines <- readLines(con = file("stdin"))
count <- length(lines)
cat("lines in standard input: ")
cat(count, sep = "\n")
This little program reads lines from the program’s standard input
using file("stdin"). This allows us to do almost anything
with it that we could do to a regular file. In this example, we passed
it as an argument to the function readLines, which stores
each line as an element in a vector. Let’s try running it from the Unix
Shell as if it were a regular command-line program:
OUTPUT
lines in standard input: 2
Note that because we did not specify sep = "\n" when
calling cat, the output is written on the same line.
A common mistake is to try to run something that reads from standard input like this:
i.e., to forget the < character that redirect the
file to standard input. In this case, there’s nothing in standard input,
so the program waits at the start of the loop for someone to type
something on the keyboard. We can type some input, but R keeps running
because it doesn’t know when the standard input has ended. If you ran
this, you can pause R by typing Ctrl+Z
(technically it is still paused in the background; if you want to fully
kill the process type kill %; see bash
manual for more information).
We now need to rewrite the program so that it loads data from
file("stdin") if no filenames are provided. Luckily,
read.csv can handle either a filename or an open file as
its first parameter, so we don’t actually need to change
process. That leaves main, which we’ll update
and save as readings-06.R:
OUTPUT
main <- function() {
  args <- commandArgs(trailingOnly = TRUE)
  action <- args[1]
  filenames <- args[-1]
  stopifnot(action %in% c("--min", "--mean", "--max"))
  if (length(filenames) == 0) {
    process(file("stdin"), action)
  } else {
    for (f in filenames) {
      process(f, action)
    }
  }
}
process <- function(filename, action) {
  dat <- read.csv(file = filename, header = FALSE)
  if (action == "--min") {
    values <- apply(dat, 1, min)
  } else if (action == "--mean") {
    values <- apply(dat, 1, mean)
  } else if (action == "--max") {
    values <- apply(dat, 1, max)
  }
  cat(values, sep = "\n")
}
main()
Let’s try it out. Instead of calculating the mean inflammation of every patient, we’ll only calculate the mean for the first 10 patients (rows):
OUTPUT
5.45
5.425
6.1
5.9
5.55
6.225
5.975
6.65
6.625
6.525
And now we’re done: the program now does everything we set out to do.
Implementing wc in R
Write a program called line-count.R that works like the
Unix wc command:
- If no filenames are given, it reports the number of lines in standard input.
 - If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.
 
OUTPUT
main <- function() {
  args <- commandArgs(trailingOnly = TRUE)
  if (length(args) > 0) {
    total_lines <- 0
    for (filename in args) {
      input <- readLines(filename)
      num_lines <- length(input)
      cat(filename)
      cat(" ")
      cat(num_lines, sep = "\n")
      total_lines <- total_lines + num_lines
    }
    if (length(args) > 1) {
      cat("Total ")
      cat(total_lines, sep = "\n")
    }
  } else {
    input <- readLines(file("stdin"))
    num_lines <- length(input)
    cat(num_lines, sep = "\n")
  }
}
main()
- Use 
commandArgs(trailingOnly = TRUE)to obtain a vector of the command-line arguments that a program was run with. - Avoid silent failures.
 - Use 
file("stdin")to connect to a program’s standard input. - Use 
cat(vec, sep = "\n")to write the elements ofvecto standard output, one per line. 
Content from Best Practices for Writing R Code
Last updated on 2025-10-07 | Edit this page
Estimated time: 15 minutes
Overview
Questions
- How can I write R code that other people can understand and use?
 
Objectives
- Describe changes made to code for future reference.
 - Understand the importance about requirements and dependencies for your code.
 - Know when to use setwd().
 - To identify and segregate distinct components in your code using # or #-.
 - Additional best practice recommendations.
 
Keep track of who wrote your code and its intended purpose
Starting your code with an annotated description of what the code does when it is run will help you when you have to look at or change it in the future. Just one or two lines at the beginning of the file can save you or someone else a lot of time and effort when trying to understand what a particular script does.
R
# This is code to replicate the analyses and figures from my 2014 Science
# paper. Code developed by Sarah Supp, Tracy Teal, and Jon Borelli
Also consider version-controlling your code as for example taught by
Software Carpentry’s
git-novice lessons. Frequent commits with explanatory,
synoptic commit messages are an elegant way to comment code. RStudio
enables this with a Git integration.
Be explicit about the requirements and dependencies of your code
Loading all of the packages that will be necessary to run your code
(using library) is a nice way of indicating which packages
are necessary to run your code. It can be frustrating to make it
two-thirds of the way through a long-running script only to find out
that a dependency hasn’t been installed.
R
library(ggplot2)
library(reshape)
library(vegan)
Another way you can be explicit about the requirements of your code and improve it’s reproducibility is to limit the “hard-coding” of the input and output files for your script. If your code will read in data from a file, define a variable early in your code that stores the path to that file. For example
R
input_file <- "data/data.csv"
output_file <- "data/results.csv"
# read input
input_data <- read.csv(input_file)
# get number of samples in data
sample_number <- nrow(input_data)
# generate results
results <- some_other_function(input_file, sample_number)
# write results
write.table(results, output_file)
is preferable to
R
# check
input_data <- read.csv("data/data.csv")
# get number of samples in data
sample_number <- nrow(input_data)
# generate results
results <- some_other_function("data/data.csv", sample_number)
# write results
write.table("data/results.csv", output_file)
It is also worth considering what the working directory is. If the working directory must change, it is best to do that at the beginning of the script.
Be careful when using
setwd()
One should exercise caution when using setwd(). Changing
directories in a script file can limit reproducibility:
- 
setwd()will return an error if the directory to which you’re trying to change doesn’t exist or if the user doesn’t have the correct permissions to access that directory. This becomes a problem when sharing scripts between users who have organized their directories differently. - If/when your script terminates with an error, you might leave the
user in a different directory than the one they started in, and if they
then call the script again, this will cause further problems. If you
must use 
setwd(), it is best to put it at the top of the script to avoid these problems. The following error message indicates that R has failed to set the working directory you specified: 
Error in setwd("~/path/to/working/directory") : cannot change working directory
It is best practice to have the user running the script begin in a consistent directory on their machine and then use relative file paths from that directory to access files (see below).
Identify and segregate distinct components in your code
It’s easy to annotate and mark your code using # or
#-to set off sections of your code and to make finding
specific parts of your code easier. For example, it’s often helpful when
writing code to separate the function definitions. If you create only
one or a few custom functions in your script, put them toward the top of
your code. If you have written many functions, put them all in their own
. R file and thensource those files. source
will define all of these functions so that your code can make use of
them as needed.
R
source("my_genius_fxns.R")
Other ideas
Use a consistent style within your code. For example, name all matrices something ending in
_mat. Consistency makes code easier to read and problems easier to spot.Keep your code in bite-sized chunks. If a single function or loop gets too long, consider looking for ways to break it into smaller pieces.
Don’t repeat yourself–automate! If you are repeating the same code over and over, use a loop or a function to repeat that code for you. Needless repetition doesn’t just waste time–it also increases the likelihood you’ll make a costly mistake!
Keep all of your source files for a project in the same directory, then use relative paths as necessary to access them. For example, use
R
dat <- read.csv(file = "files/dataset-2013-01.csv", header = TRUE)
rather than:
R
dat <- read.csv(file = "/Users/Karthik/Documents/sannic-project/files/dataset-2013-01.csv", header = TRUE)
- R can run into memory issues. It is a common problem to run out of
memory after running R scripts for a long time. To inspect the objects
in your current R environment, you can list the objects, search current
packages, and remove objects that are currently not in use. A good
practice when running long lines of computationally intensive code is to
remove temporary objects after they have served their purpose. However,
sometimes, R will not clean up unused memory for a while after you
delete objects. You can force R to tidy up its memory by using
gc(). 
R
# Sample dataset of 1000 rows
interim_object <- data.frame(rep(1:100, 10),
                             rep(101:200, 10),
                             rep(201:300, 10))
object.size(interim_object) # Reports the memory size allocated to the object
rm("interim_object") # Removes only the object itself and not necessarily the memory allotted to it
gc() # Force R to release memory it is no longer using
ls() # Lists all the objects in your current workspace
rm(list = ls()) # If you want to delete all the objects in the workspace and start with a clean slate
Don’t save a session history (the default option in R, when it asks if you want an
RDatafile). Instead, start in a clean environment so that older objects don’t remain in your environment any longer than they need to. If that happens, it can lead to unexpected results.Wherever possible, keep track of
sessionInfo()somewhere in your project folder. Session information is invaluable because it captures all of the packages used in the current project. If a newer version of a package changes the way a function behaves, you can always go back and reinstall the version that worked (Note: At least on CRAN, all older versions of packages are permanently archived).Collaborate. Grab a buddy and practice “code review”. Review is used for preparing experiments and manuscripts; why not use it for code as well? Our code is also a major scientific achievement and the product of lots of hard work! Reviews are built into GitHub’s Pull request feature
Best Practice
- What other suggestions do you have for coding best practices?
 - What are some specific ways we could restructure the code we worked on today to make it easier for a new user to read? Discuss with your neighbor.
 - Make two new R scripts called 
inflammation.Randinflammation_fxns.R. Copy and paste code into each script so thatinflammation.R“does stuff” andinflammation_fxns.Rholds all of your functions. Hint: you will need to addsourceto one of the files. 
OUTPUT
# This code runs the inflammation data analysis.
source("inflammation_fxns.R")
analyze_all("inflammation.*csv")
OUTPUT
# This is code for functions used in our inflammation data analysis.
analyze <- function(filename, output = NULL) {
  # Plots the average, min, and max inflammation over time.
  # Input:
  #    filename: character string of a csv file
  #    output: character string of pdf file for saving
  if (!is.null(output)) {
    pdf(output)
  }
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
  if (!is.null(output)) {
    dev.off()
  }
}
analyze_all <- function(pattern) {
  # Directory name containing the data
  data_dir <- "data"
  # Directory name for results
  results_dir <- "results"
  # Runs the function analyze for each file in the current working directory
  # that contains the given pattern.
  filenames <- list.files(path = data_dir, pattern = pattern)
  for (f in filenames) {
    pdf_name <- file.path(results_dir, sub("csv", "pdf", f))
    analyze(file.path(data_dir, f), output = pdf_name)
  }
}
- Start each program with a description of what it does.
 - Then load all required packages.
 - Consider what working directory you are in when sourcing a script.
 - Use comments to mark off sections of code.
 - Put function definitions at the top of your file, or in a separate file if there are many.
 - Name and style code consistently.
 - Break code into small, discrete pieces.
 - Factor out common operations rather than repeating them.
 - Keep all of the source files for a project in one directory and use relative paths to access them.
 - Keep track of the memory used by your program.
 - Always start with a clean environment instead of saving the workspace.
 - Keep track of session information in your project folder.
 - Have someone else review your code.
 - Use version control.
 
Content from Dynamic Reports with knitr
Last updated on 2025-10-07 | Edit this page
Estimated time: 25 minutes
Overview
Questions
- How can I put my text, code, and results all in one document?
 - How do I use 
knitr? - How do I write in Markdown?
 
Objectives
- Understand the value of 
knitrfor generating dynamic documents that include text, code, and results. - Control basic formatting using markdown syntax.
 - Be able to create, edit, and compile an .Rmd document including code chunks and inline code.
 
knitr is an R package that allows you to organize your
notes, code, and results in a single document. It’s a great tool for
“literate programming” – the idea that your code should be readable by
humans as well as computers! It also keeps your writing and results
together, so if you collect some new data or change how you clean the
data, you just have to re-compile the document and you’re up to
date!
You write knitr documents in a simple plain text-like
format called markdown, which allows you to format text using intuitive
notation, so that you can focus on the content you’re writing and
generating a well-formatted document when needed. In fact, you can turn
your plain text (and R code and results) into an HTML file or, if you
have an installation of LaTeX and Pandoc on your machine, a PDF, or even a
Word document (if you must!).
To get started, install the knitr package.
R
install.packages("knitr")
To create a new knitr document in RStudio, click on File
-> New File, and then select “R Markdown…” from the dropdown list.
Accept the default options in the dialog box that follows (but note the
other document formats to choose from, including presentations). Save
the file and click on the dropdown arrow next to the “Knit” button at
the top of the script window. This will bring up another dropdown menu,
this time with a number of options for controlling knitr
output. Select the “Knit to HTML” option to create an HTML document
based on the markdown file. Compare the output to the source.
Formatting Text in Markdown
Visit https://rmarkdown.rstudio.com/authoring_basics.html and briefly check out some of the formatting options.
In the example document add
- Headers using 
# - Emphasis using asterisk: *italics* and **bold**
 - Lists using 
*and numbered lists using1.,2., etc. - Bonus: Create a table
 
Markdown also supports LaTeX equation editing. We can display pretty
equations by enclosing them in $, e.g.,
$\alpha = \dfrac{1}{(1 - \beta)^2}$ renders as: \(\alpha = \dfrac{1}{(1 - \beta)^2}\)
The top of the source (.Rmd) file has some header
material in YAML format (enclosed by triple dashes). Some of this gets
displayed in the output header, other of it provides formatting
information to the conversion engine.
To distinguish R code from text, RMarkdown uses three back-ticks
followed by {r} to distinguish a “code chunk”. In RStudio,
the keyboard shortcut to create a code chunk is
Command+Option+I or
Ctrl+Alt+I.
A code chunk will set off the code and its results in the output
document, but you can also print the results of code within a text block
by enclosing code like so: r code-here.
Use knitr to Produce a Report
- Open a new .Rmd script and save it as inflammation_report.Rmd
 - Copy code from earlier into code chunks to read the inflammation data and plot average inflammation.
 - Add a few notes describing what the code does and what the main findings are. Include an in-line calculation of the median inflammation level.
 - 
knitthe document and view the html result. 
- Use knitr to generate reports that combine text, code, and results.
 - Use Markdown to format text.
 - Put code in blocks delimited by triple back quotes followed by
{r}. 
Content from Making Packages in R
Last updated on 2025-10-07 | Edit this page
Estimated time: 35 minutes
Overview
Questions
- How do I collect my code together so I can reuse it and share it?
 - How do I make my own packages?
 
Objectives
- Describe the required structure of R packages.
 - Create the required structure of a simple R package.
 - Write documentation comments that can be automatically compiled to R’s native help and documentation format.
 
Why should you make your own R packages?
Reproducible research!
An R package is the basic unit of reusable code. If you want to reuse code later or want others to be able to use your code, you should put it in a package.
An R package requires four components, visualized in the folder tree below:
package_name/
├── DESCRIPTION              # metadata about the package
├── man/                     # function documentation (can be generated automatically)
├── NAMESPACE                # list of user-level functions in the package (can be generated automatically)
├── R/                       # R source code
└── <other_components>
There are other optional components. rOpenSci community has written a science-focused guidebook for package development, while the “R packages” book contains all the fundamental information.
DESCRIPTION file
Package: package_name
Title: Brief package description
Description: Longer package description
Version: Version number(major.minor.patch)
Author: Name and email of package creator
Maintainer: Name and email of package maintainer (who to contact with issues)
License: Abbreviation for an open source license
The package name can only contain letters and numbers and has to start with a letter.
.R files
Functions don’t all have to be in one file or each in separate files. How you organize them is up to you. Suggestion: organize in a logical manner so that you know which file holds which functions.
Making your first R package
Let’s turn our temperature conversion functions into an R package.
R
fahrenheit_to_celsius <- function(temp_F) {
  # Converts Fahrenheit to Celsius
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}
R
celsius_to_kelvin <- function(temp_C) {
  # Converts Celsius to Kelvin
  temp_K <- temp_C + 273.15
  return(temp_K)
}
R
fahrenheit_to_kelvin <- function(temp_F) {
  # Converts Fahrenheit to Kelvin using fahrenheit_to_celsius() and celsius_to_kelvin()
  temp_C <- fahrenheit_to_celsius(temp_F)
  temp_K <- celsius_to_kelvin(temp_C)
  return(temp_K)
}
We will use the devtools and roxygen2 packages, which make creating packages in R relatively simple. Both can be installed from CRAN like this:
R
install.packages(c("devtools", "roxygen2"))  # installations can be `c`ombined
library("devtools")
library("roxygen2")
Set your working directory, and then use the create
function to start making your package. Keep the name simple and
unique.
- package_to_convert_temperatures_between_kelvin_fahrenheit_and_celsius (BAD)
 - tempConvert (GOOD)
 
R
setwd(parentDirectory)
create_package("tempConvert")
Add our functions to the R directory. Place each function into a separate R script and add documentation like this:
R
#' Converts Fahrenheit to Celsius
#'
#' This function converts input temperatures in Fahrenheit to Celsius.
#' @param temp_F The temperature in Fahrenheit.
#' @return The temperature in Celsius.
#' @export
#' @examples
#' fahrenheit_to_celsius(32)
fahrenheit_to_celsius <- function(temp_F) {
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}
The roxygen2 package reads lines that begin with
#' as comments to create the documentation for your
package. Descriptive tags are preceded with the @ symbol.
For example, @param has information about the input
parameters for the function. Now, we will use roxygen2 to
convert our documentation to the standard R format.
R
setwd("./tempConvert")
document()
Take a look at the package directory now. The /man directory has a . Rd file for each . R file with properly formatted documentation.
Overall, your package directory should look something like this:
In the diagram above, your package is a folder (the same as it would exist on your computer!). This folder contains several files and sub-folders. For example, the folder ‘R’ contains all of the R scripts you just wrote above. Now, let’s load the package and take a look at the documentation.
R
setwd("..")
install("tempConvert")
?fahrenheit_to_kelvin
Notice there is now a tempConvert environment that is the parent environment to the global environment.
R
search()
Now that our package is loaded, let’s try out some of the functions.
R
fahrenheit_to_celsius(32)
OUTPUT
[1] 0
R
celsius_to_kelvin(-273.15)
OUTPUT
[1] 0
R
fahrenheit_to_kelvin(-459.67)
OUTPUT
[1] 0
Creating a Package for Distribution
- Create some new functions for your tempConvert package to convert from Celsius to Fahrenheit or from Kelvin to Celsius or Fahrenheit.
 - Create a package for our 
analyzefunction so that it will be easy to load when more data arrives. 
R
#' Converts Kelvin to Celsius
#'
#' This function converts input temperatures in Kelvin to Celsius.
#' @param temp_K The temperature in Kelvin.
#' @return The temperature in Celsius.
#' @export
#' @examples
#' kelvin_to_celsius(273.15)
kelvin_to_celsius <- function(temp_K) {
  temp_C <- temp_K - 273.15
  temp_C
}
R
#' Converts Celsius to Fahrenheit
#'
#' This function converts input temperatures in Celsius to Fahrenheit.
#' @param temp_C The temperature in Celsius.
#' @return The temperature in Fahrenheit.
#' @export
#' @examples
#' celsius_to_fahrenheit(0)
celsius_to_fahrenheit <- function(temp_C) {
  temp_F <- (temp_C * 9/5) + 32
  temp_F
}
R
#' Converts Kelvin to Fahrenheit
#'
#' This function converts input temperatures in Kelvin to Fahrenheit.
#' @param temp_K The temperature in Kelvin.
#' @return The temperature in Fahrenheit.
#' @export
#' @examples
#' kelvin_to_fahrenheit(273.15)
kelvin_to_fahrenheit <- function(temp_K) {
  temp_C <- kelvin_to_celsius(temp_K)
  temp_F <- celsius_to_fahrenheit(temp_C)
  temp_F
}
- A package is the basic unit of reusability in R.
 - Every package must have a DESCRIPTION file and an R directory containing code. These are created by us.
 - A NAMESPACE file is needed as well, and a man directory containing documentation, but both can be autogenerated.
 
Content from Introduction to RStudio
Last updated on 2025-10-07 | Edit this page
Estimated time: 15 minutes
Overview
Questions
- How do I use the RStudio graphical user interface?
 
Objectives
- Get familiar with RStudio interface.
 - Understand the difference between script file and console.
 - Demonstrate useful shortcuts.
 
Let’s start by learning about our tool.
- Point to the different panels: Console, Scripts, Environments, Plots.
 - Code and workflow are more reproducible if we can document everything that we do.
 - Our end goal is not just to “do stuff” but to do it in a way that anyone can easily and exactly replicate our workflow and results.
 - The best way to achieve this is to write scripts. RStudio provides an environment that allows you to do that.
 
Interacting with R
There are two main ways of interacting with R: using the console or by using script files (plain text files that contain your code).
The console window (in RStudio, the bottom left panel) is the place where R is waiting for you to tell it what to do, and where it will show the results of a command. You can type commands directly into the console, but they will be forgotten when you close the session. It is better to enter the commands in the script editor, and save the script. This way, you have a complete record of what you did, you can easily show others how you did it and you can do it again later on if needed. You can copy-paste into the R console, but the Rstudio script editor allows you to ‘send’ the current line or the currently selected text to the R console using the Ctrl+Return shortcut.
At some point in your analysis you may want to check the content of variable or the structure of an object, without necessarily keep a record of it in your script. You can type these commands directly in the console. RStudio provides the Ctrl+1 and Ctrl+2 shortcuts allow you to jump between the script and the console windows.
If R is ready to accept commands, the R console shows a
> prompt. If it receives a command (by typing,
copy-pasting or sent from the script editor using
Ctrl+Return), R will try to execute it, and when
ready, show the results and come back with a new
>-prompt to wait for new commands.
If R is still waiting for you to enter more data because it isn’t
complete yet, the console will show a + prompt. It means
that you haven’t finished entering a complete command. This is because
you have not ‘closed’ a parenthesis or quotation. If you’re in RStudio
and this happens, click inside the console window and press
Esc; this should help you out of trouble.
Commenting
Use # signs to comment. Comment liberally in your R
scripts. Anything to the right of a # is ignored by R.
Comments remind yourself of what certain functions you write do, and it
helps others understand their purpose without having to read the code.
Start your comments with a # signs, which R takes as the
sign to ignore anything to the right of it when running the code. To
comment multiple lines in RStudio, you can select the lines you want to
comment and use Ctrl+Shift+C to
comment/uncomment.
Also remember that expressive naming of functions, arguments and
variables, as well as #'-prefixed roxygen
documentation, document your code. Consider free-form, #-comments as
a last resort.
Assignment Operator
<- is the assignment operator. It assigns values on
the right to objects on the left. So, after executing
x <- 3, the value of x is 3.
The arrow can be read as 3 goes into x.
You can also use = for assignments but not in all contexts
so it is good practice to use <- for assignments.
= should only be used to specify the values of arguments in
functions, see below.
In RStudio, typing Alt+- (push Alt,
the key next to your space bar at the same time as the - key)
will write <- in a single keystroke.
- Using RStudio can make programming in R much more productive.
 
Content from Addressing Data
Last updated on 2025-10-07 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- What are the different methods for accessing parts of a data frame?
 
Objectives
- Understand the three different ways R can address data inside a data frame.
 - Combine different methods for addressing data with the assignment operator to update subsets of data.
 
R is a powerful language for data manipulation. There are three main ways for addressing data inside R objects.
- By index (subsetting)
 - By logical vector
 - By name
 
Lets start by loading some sample data:
R
dat <- read.csv(file = 'data/sample.csv', header = TRUE, stringsAsFactors = FALSE)
Interpreting Rows as Headers
The first row of this csv file is a list of column names. We used the
header = TRUE argument to read.csv so that R can
interpret the file correctly. We are using the stringsAsFactors =
FALSE argument to override the default behaviour for R. Using
factors in R is covered in a separate lesson.
Lets take a look at this data.
R
class(dat)
OUTPUT
[1] "data.frame"
R has loaded the contents of the .csv file into a variable called
dat which is a data frame.
We can compactly display the internal structure of a data frame using
the structure function str.
R
str(dat)
OUTPUT
'data.frame':	100 obs. of  9 variables:
 $ ID           : chr  "Sub001" "Sub002" "Sub003" "Sub004" ...
 $ Gender       : chr  "m" "m" "m" "f" ...
 $ Group        : chr  "Control" "Treatment2" "Treatment2" "Treatment1" ...
 $ BloodPressure: int  132 139 130 105 125 112 173 108 131 129 ...
 $ Age          : num  16 17.2 19.5 15.7 19.9 14.3 17.7 19.8 19.4 18.8 ...
 $ Aneurisms_q1 : int  114 148 196 199 188 260 135 216 117 188 ...
 $ Aneurisms_q2 : int  140 209 251 140 120 266 98 238 215 144 ...
 $ Aneurisms_q3 : int  202 248 122 233 222 320 154 279 181 192 ...
 $ Aneurisms_q4 : int  237 248 177 220 228 294 245 251 272 185 ...
The str function tell us that the data has 100 rows and
9 columns. It is also tell us that the data frame is made up of
character chr, integer int and
numeric vectors.
R
head(dat)
OUTPUT
      ID Gender      Group BloodPressure  Age Aneurisms_q1 Aneurisms_q2
1 Sub001      m    Control           132 16.0          114          140
2 Sub002      m Treatment2           139 17.2          148          209
3 Sub003      m Treatment2           130 19.5          196          251
4 Sub004      f Treatment1           105 15.7          199          140
5 Sub005      m Treatment1           125 19.9          188          120
6 Sub006      M Treatment2           112 14.3          260          266
  Aneurisms_q3 Aneurisms_q4
1          202          237
2          248          248
3          122          177
4          233          220
5          222          228
6          320          294
The data is the results of an (not real) experiment, looking at the number of aneurysms that formed in the eyes of patients who undertook 3 different treatments.
Addressing by Index
Data can be accessed by index. We have already seen how square
brackets [ can be used to subset data (sometimes also
called “slicing”). The generic format is
dat[row_numbers,column_numbers].
Selecting Values
What will be returned by dat[1, 1]? Think about the
number of rows and columns you would expect as the result.
R
dat[1, 1]
OUTPUT
[1] "Sub001"
If we leave out a dimension R will interpret this as a request for all values in that dimension.
Selecting More Values
What will be returned by dat[, 2]?
R
dat[, 2]
OUTPUT
  [1] "m" "m" "m" "f" "m" "M" "f" "m" "m" "f" "m" "f" "f" "m" "m" "m" "f" "m"
 [19] "m" "F" "f" "m" "f" "f" "m" "M" "M" "f" "m" "f" "f" "m" "m" "m" "m" "f"
 [37] "f" "m" "M" "m" "f" "m" "m" "m" "f" "f" "M" "M" "m" "m" "m" "f" "f" "f"
 [55] "m" "f" "m" "m" "m" "f" "f" "f" "f" "M" "f" "m" "f" "f" "M" "m" "m" "m"
 [73] "F" "m" "m" "f" "M" "M" "M" "f" "m" "M" "M" "m" "m" "f" "f" "f" "m" "m"
 [91] "f" "m" "F" "f" "m" "m" "F" "m" "M" "M"
The colon : can be used to create a sequence of
integers.
R
6:9
OUTPUT
[1] 6 7 8 9
Creates a vector of numbers from 6 to 9.
This can be very useful for addressing data.
Subsetting with Sequences
Use the colon operator to index just the aneurism count data (columns 6 to 9).
R
dat[, 6:9]
OUTPUT
    Aneurisms_q1 Aneurisms_q2 Aneurisms_q3 Aneurisms_q4
1            114          140          202          237
2            148          209          248          248
3            196          251          122          177
4            199          140          233          220
5            188          120          222          228
6            260          266          320          294
7            135           98          154          245
8            216          238          279          251
9            117          215          181          272
10           188          144          192          185
11           134          155          247          223
12           152          177          323          245
13           112          220          225          195
14           109          150          177          189
15           146          140          239          223
16            97          172          203          207
17           165          157          200          193
18           158          265          243          187
19           178          109          206          182
20           107          188          167          218
21           174          160          203          183
22            97          110          194          133
23           187          239          281          214
24           188          191          256          265
25           114          199          242          195
26           115          160          158          228
27           128          249          294          315
28           112          230          281          126
29           136          109          105          155
30           103          148          219          228
31           132          151          234          162
32           118          154          260          160
33           166          176          253          233
34           152          105          197          299
35           191          148          166          185
36           152          178          158          170
37           161          270          232          284
38           239          184          317          269
39           132          137          193          206
40           168          255          273          274
41           140          184          239          202
42           166           85          179          196
43           141          160          179          239
44           161          168          212          181
45           103          111          254          126
46           231          240          260          310
47           192          141          180          225
48           178          180          169          183
49           167          123          236          224
50           135          150          208          279
51           150          166          153          204
52           192           80          138          222
53           153          153          236          216
54           205          264          269          207
55           117          194          216          211
56           199          119          183          251
57           182          129          226          218
58           180          196          250          294
59           111          111          244          201
60           101           98          178          116
61           166          167          232          241
62           158          171          237          212
63           189          178          177          238
64           189          101          193          172
65           239          189          297          300
66           185          224          151          182
67           224          112          304          288
68           104          139          211          204
69           222          199          280          196
70           107           98          204          138
71           153          255          218          234
72           118          165          220          227
73           102          184          246          222
74           188          125          191          157
75           180          283          204          298
76           178          214          291          240
77           168          184          184          229
78           118          170          249          249
79           169          114          248          233
80           156          138          218          258
81           232          211          219          246
82           188          108          180          136
83           169          168          180          211
84           241          233          292          182
85            65          207          234          235
86           225          185          195          235
87           104          116          173          221
88           179          158          216          244
89           103          140          209          186
90           112          130          175          191
91           226          170          307          244
92           228          221          316          259
93           209          142          199          184
94           153          104          194          214
95           111          118          173          191
96           148          132          200          194
97           141          196          322          273
98           193          112          123          181
99           130          226          286          281
100          126          157          129          160
Finally we can use the c() (combine) function to address
non-sequential rows and columns.
R
dat[c(1, 5, 7, 9), 1:5]
OUTPUT
      ID Gender      Group BloodPressure  Age
1 Sub001      m    Control           132 16.0
5 Sub005      m Treatment1           125 19.9
7 Sub007      f    Control           173 17.7
9 Sub009      m Treatment2           131 19.4
Returns the first 5 columns for patients in rows 1,5,7 and 9
Subsetting Non-Sequential Data
Write code to return the age and gender values for the first 5 patients.
R
dat[1:5, c(5, 2)]
OUTPUT
   Age Gender
1 16.0      m
2 17.2      m
3 19.5      m
4 15.7      f
5 19.9      m
Addressing by Name
Columns in an R data frame are named.
R
colnames(dat)
OUTPUT
[1] "ID"            "Gender"        "Group"         "BloodPressure"
[5] "Age"           "Aneurisms_q1"  "Aneurisms_q2"  "Aneurisms_q3"
[9] "Aneurisms_q4" 
Default Names
If column names are not specified e.g. using
header = FALSE in a read.csv() function, R
assigns default names V1, V2, ..., Vn
We usually use the $ operator to address a column by
name
R
dat$Gender
OUTPUT
  [1] "m" "m" "m" "f" "m" "M" "f" "m" "m" "f" "m" "f" "f" "m" "m" "m" "f" "m"
 [19] "m" "F" "f" "m" "f" "f" "m" "M" "M" "f" "m" "f" "f" "m" "m" "m" "m" "f"
 [37] "f" "m" "M" "m" "f" "m" "m" "m" "f" "f" "M" "M" "m" "m" "m" "f" "f" "f"
 [55] "m" "f" "m" "m" "m" "f" "f" "f" "f" "M" "f" "m" "f" "f" "M" "m" "m" "m"
 [73] "F" "m" "m" "f" "M" "M" "M" "f" "m" "M" "M" "m" "m" "f" "f" "f" "m" "m"
 [91] "f" "m" "F" "f" "m" "m" "F" "m" "M" "M"
When we extract a single column from a data frame using the
$ operator, R will return a vector of that column class and
not a data frame.
R
class(dat$Gender)
OUTPUT
[1] "character"
R
class(dat$BloodPressure)
OUTPUT
[1] "integer"
Named addressing can also be used in square brackets.
R
head(dat[, c('Age', 'Gender')])
OUTPUT
   Age Gender
1 16.0      m
2 17.2      m
3 19.5      m
4 15.7      f
5 19.9      m
6 14.3      M
Best Practice
Best practice is to address columns by name. Often, you will create or delete columns and the column position will change.
Rows in an R data frame can also be named, and rows can also be
addressed by their names. 
 By default, row names are indices
(i.e. position of each row in the data frame):
R
rownames(dat)
OUTPUT
  [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"
 [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24"
 [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36"
 [37] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48"
 [49] "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60"
 [61] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72"
 [73] "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"
 [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"
 [97] "97"  "98"  "99"  "100"
We can add row names as we read in the file with the
row.names parameter in read.csv.
In the following example, we choose the first column ID to
become the vector of row names of the data frame, with
row.names = 1.
R
dat2 <- read.csv(file = 'data/sample.csv', header = TRUE, stringsAsFactors = FALSE, row.names=1)
rownames(dat2)
OUTPUT
  [1] "Sub001" "Sub002" "Sub003" "Sub004" "Sub005" "Sub006" "Sub007" "Sub008"
  [9] "Sub009" "Sub010" "Sub011" "Sub012" "Sub013" "Sub014" "Sub015" "Sub016"
 [17] "Sub017" "Sub018" "Sub019" "Sub020" "Sub021" "Sub022" "Sub023" "Sub024"
 [25] "Sub025" "Sub026" "Sub027" "Sub028" "Sub029" "Sub030" "Sub031" "Sub032"
 [33] "Sub033" "Sub034" "Sub035" "Sub036" "Sub037" "Sub038" "Sub039" "Sub040"
 [41] "Sub041" "Sub042" "Sub043" "Sub044" "Sub045" "Sub046" "Sub047" "Sub048"
 [49] "Sub049" "Sub050" "Sub051" "Sub052" "Sub053" "Sub054" "Sub055" "Sub056"
 [57] "Sub057" "Sub058" "Sub059" "Sub060" "Sub061" "Sub062" "Sub063" "Sub064"
 [65] "Sub065" "Sub066" "Sub067" "Sub068" "Sub069" "Sub070" "Sub071" "Sub072"
 [73] "Sub073" "Sub074" "Sub075" "Sub076" "Sub077" "Sub078" "Sub079" "Sub080"
 [81] "Sub081" "Sub082" "Sub083" "Sub084" "Sub085" "Sub086" "Sub087" "Sub088"
 [89] "Sub089" "Sub090" "Sub091" "Sub092" "Sub093" "Sub094" "Sub095" "Sub096"
 [97] "Sub097" "Sub098" "Sub099" "Sub100"
We can now extract one or more rows using those row names:
R
dat2["Sub072", ]
OUTPUT
       Gender   Group BloodPressure  Age Aneurisms_q1 Aneurisms_q2 Aneurisms_q3
Sub072      m Control           116 17.4          118          165          220
       Aneurisms_q4
Sub072          227
R
dat2[c("Sub009", "Sub072"), ]
OUTPUT
       Gender      Group BloodPressure  Age Aneurisms_q1 Aneurisms_q2
Sub009      m Treatment2           131 19.4          117          215
Sub072      m    Control           116 17.4          118          165
       Aneurisms_q3 Aneurisms_q4
Sub009          181          272
Sub072          220          227
Note that row names must be unique!
For example, if we try and read in the data setting the Group column as row names, R will throw an error because values in that column are duplicated:
R
dat2 <- read.csv(file = 'data/sample.csv', header = TRUE, stringsAsFactors = FALSE, row.names=3)
ERROR
Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed
Addressing by Logical Vector
A logical vector contains only the special values TRUE
and FALSE.
R
c(TRUE, TRUE, FALSE, FALSE, TRUE)
OUTPUT
[1]  TRUE  TRUE FALSE FALSE  TRUE
Truth and Its Opposite
Note the values TRUE and FALSE are all
capital letters and are not quoted.
Logical vectors can be created using
relational operators
e.g. <, >, ==, !=, %in%.
R
x <- c(1, 2, 3, 11, 12, 13)
x < 10
OUTPUT
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE
R
x %in% 1:10
OUTPUT
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE
We can use logical vectors to select data from a data frame. This is often referred to as logical indexing.
R
index <- dat$Group == 'Control'
dat[index,]$BloodPressure
OUTPUT
 [1] 132 173 129  77 158  81 137 111 135 108 133 139 126 125  99 122 155 133  94
[20]  98  74 116  97 104 117  90 150 116 108 102
Often this operation is written as one line of code:
R
plot(dat[dat$Group == 'Control', ]$BloodPressure)
Using Logical Indexes
- Create a scatterplot showing BloodPressure for subjects not in the control group.
 - How many ways are there to index this set of subjects?
 
- The code for such a plot:
 
R
plot(dat[dat$Group != 'Control', ]$BloodPressure)
2. In addition to 
dat$Group != 'Control', one could use
dat$Group %in% c("Treatment1", "Treatment2").
Combining Addressing and Assignment
The assignment operator <- can be combined with
addressing.
R
x <- c(1, 2, 3, 11, 12, 13)
x[x < 10] <- 0
x
OUTPUT
[1]  0  0  0 11 12 13
Updating a Subset of Values
In this dataset, values for Gender have been recorded as both
uppercase M, F and lowercase m, f. Combine the
addressing and assignment operations to convert all values to
lowercase.
R
dat[dat$Gender == 'M', ]$Gender <- 'm'
dat[dat$Gender == 'F', ]$Gender <- 'f'
- Data in data frames can be addressed by index (subsetting), by logical vector, or by name (columns only).
 - Use the 
$operator to address a column by name. 
Content from Reading and Writing CSV Files
Last updated on 2025-10-07 | Edit this page
Estimated time: 35 minutes
Overview
Questions
- How do I read data from a CSV file into R?
 - How do I write data to a CSV file?
 
Objectives
- Read in a .csv, and explore the arguments of the csv reader.
 - Write the altered data set to a new .csv, and explore the arguments.
 
The most common way that scientists store data is in Excel spreadsheets. While there are R packages designed to access data from Excel spreadsheets (e.g., gdata, RODBC, XLConnect, xlsx, RExcel), users often find it easier to save their spreadsheets in comma-separated values files (CSV) and then use R’s built in functionality to read and manipulate the data. In this short lesson, we’ll learn how to read data from a .csv and write to a new .csv, and explore the arguments that allow you read and write the data correctly for your needs.
Read a .csv and Explore the Arguments
Let’s start by opening a .csv file containing information on the
speeds at which cars of different colors were clocked in 45 mph zones in
the four-corners states (car-speeds.csv). We will use the
built in read.csv(...) function call, which reads the
data in as a data frame, and assign the data frame to a variable (using
<-) so that it is stored in R’s memory. Then we will
explore some of the basic arguments that can be supplied to the
function. First, open the RStudio project containing the scripts and
data you were working on in episode ‘Analyzing Patient Data’.
R
# Import the data and look at the first six rows
carSpeeds <- read.csv(file = 'data/car-speeds.csv')
head(carSpeeds)
OUTPUT
  Color Speed     State
1  Blue    32 NewMexico
2   Red    45   Arizona
3  Blue    35  Colorado
4 White    34   Arizona
5   Red    25   Arizona
6  Blue    41   Arizona
Changing Delimiters
The default delimiter of the read.csv() function is a
comma, but you can use other delimiters by supplying the ‘sep’ argument
to the function (e.g., typing sep = ';' allows a semi-colon
separated file to be correctly imported - see ?read.csv()
for more information on this and other options for working with
different file types).
The call above will import the data, but we have not taken advantage of several handy arguments that can be helpful in loading the data in the format we want. Let’s explore some of these arguments.
The header Argument
The default for read.csv(...) is to set the
header argument to TRUE. This means that the
first row of values in the .csv is set as header information (column
names). If your data set does not have a header, set the
header argument to FALSE:
R
# The first row of the data without setting the header argument:
carSpeeds[1, ]
OUTPUT
  Color Speed     State
1  Blue    32 NewMexico
R
# The first row of the data if the header argument is set to FALSE:
carSpeeds <- read.csv(file = 'data/car-speeds.csv', header = FALSE)
carSpeeds[1, ]
OUTPUT
     V1    V2    V3
1 Color Speed State
Clearly this is not the desired behavior for this data set, but it may be useful if you have a dataset without headers.
The stringsAsFactors Argument
In older versions of R (prior to 4.0) this was perhaps the most
important argument in read.csv(), particularly if you were
working with categorical data. This is because the default behavior of R
was to convert character strings
into factors, which may make it
difficult to do such things as replace values. It is important to be
aware of this behaviour, which we will demonstrate. For example, let’s
say we find out that the data collector was color blind, and
accidentally recorded green cars as being blue. In order to correct the
data set, let’s replace ‘Blue’ with ‘Green’ in the $Color
column:
R
# Here we will use R's `ifelse` function, in which we provide the test phrase,
# the outcome if the result of the test is 'TRUE', and the outcome if the
# result is 'FALSE'. We will also assign the results to the Color column,
# using '<-'
# First - reload the data with a header
carSpeeds <- read.csv(file = 'data/car-speeds.csv', stringsAsFactors = TRUE)
carSpeeds$Color <- ifelse(carSpeeds$Color == 'Blue', 'Green', carSpeeds$Color)
carSpeeds$Color
OUTPUT
  [1] "Green" "1"     "Green" "5"     "4"     "Green" "Green" "2"     "5"
 [10] "4"     "4"     "5"     "Green" "Green" "2"     "4"     "Green" "Green"
 [19] "5"     "Green" "Green" "Green" "4"     "Green" "4"     "4"     "4"
 [28] "4"     "5"     "Green" "4"     "5"     "2"     "4"     "2"     "2"
 [37] "Green" "4"     "2"     "4"     "2"     "2"     "4"     "4"     "5"
 [46] "2"     "Green" "4"     "4"     "2"     "2"     "4"     "5"     "4"
 [55] "Green" "Green" "2"     "Green" "5"     "2"     "4"     "Green" "Green"
 [64] "5"     "2"     "4"     "4"     "2"     "Green" "5"     "Green" "4"
 [73] "5"     "5"     "Green" "Green" "Green" "Green" "Green" "5"     "2"
 [82] "Green" "5"     "2"     "2"     "4"     "4"     "5"     "5"     "5"
 [91] "5"     "4"     "4"     "4"     "5"     "2"     "5"     "2"     "2"
[100] "5"    
What happened?!? It looks like ‘Blue’ was replaced with ‘Green’, but every other color was turned into a number (as a character string, given the quote marks before and after). This is because the colors of the cars were loaded as factors, and the factor level was reported following replacement.
To see the internal structure, we can use another function,
str(). In this case, the dataframe’s internal structure
includes the format of each column, which is what we are interested in.
str() will be reviewed a little more in the lesson Data Types and Structures.
R
# Reload the data with a header (the previous ifelse call modifies attributes)
carSpeeds <- read.csv(file = 'data/car-speeds.csv', stringsAsFactors = TRUE)
str(carSpeeds)
OUTPUT
'data.frame':	100 obs. of  3 variables:
 $ Color: Factor w/ 5 levels " Red","Black",..: 3 1 3 5 4 3 3 2 5 4 ...
 $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...
 $ State: Factor w/ 4 levels "Arizona","Colorado",..: 3 1 2 1 1 1 3 2 1 2 ...
We can see that the $Color and $State
columns are factors and $Speed is a numeric column.
Now, let’s load the dataset using
stringsAsFactors=FALSE, and see what happens when we try to
replace ‘Blue’ with ‘Green’ in the $Color column:
R
carSpeeds <- read.csv(file = 'data/car-speeds.csv', stringsAsFactors = FALSE)
str(carSpeeds)
OUTPUT
'data.frame':	100 obs. of  3 variables:
 $ Color: chr  "Blue" " Red" "Blue" "White" ...
 $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...
 $ State: chr  "NewMexico" "Arizona" "Colorado" "Arizona" ...
R
carSpeeds$Color <- ifelse(carSpeeds$Color == 'Blue', 'Green', carSpeeds$Color)
carSpeeds$Color
OUTPUT
  [1] "Green" " Red"  "Green" "White" "Red"   "Green" "Green" "Black" "White"
 [10] "Red"   "Red"   "White" "Green" "Green" "Black" "Red"   "Green" "Green"
 [19] "White" "Green" "Green" "Green" "Red"   "Green" "Red"   "Red"   "Red"
 [28] "Red"   "White" "Green" "Red"   "White" "Black" "Red"   "Black" "Black"
 [37] "Green" "Red"   "Black" "Red"   "Black" "Black" "Red"   "Red"   "White"
 [46] "Black" "Green" "Red"   "Red"   "Black" "Black" "Red"   "White" "Red"
 [55] "Green" "Green" "Black" "Green" "White" "Black" "Red"   "Green" "Green"
 [64] "White" "Black" "Red"   "Red"   "Black" "Green" "White" "Green" "Red"
 [73] "White" "White" "Green" "Green" "Green" "Green" "Green" "White" "Black"
 [82] "Green" "White" "Black" "Black" "Red"   "Red"   "White" "White" "White"
 [91] "White" "Red"   "Red"   "Red"   "White" "Black" "White" "Black" "Black"
[100] "White"
That’s better! And we can see how the data now is read as character
instead of factor. From R version 4.0 onwards we do not have to specify
stringsAsFactors=FALSE, this is the default behavior.
The as.is Argument
This is an extension of the stringsAsFactors argument,
but gives you control over individual columns. For example, if we want
the colors of cars imported as strings, but we want the names of the
states imported as factors, we would load the data set as:
R
carSpeeds <- read.csv(file = 'data/car-speeds.csv', as.is = 1)
# Note, the 1 applies as.is to the first column only
Now we can see that if we try to replace ‘Blue’ with ‘Green’ in the
$Color column everything looks fine, while trying to
replace ‘Arizona’ with ‘Ohio’ in the $State column returns
the factor numbers for the names of states that we haven’t replaced:
R
str(carSpeeds)
OUTPUT
'data.frame':	100 obs. of  3 variables:
 $ Color: chr  "Blue" " Red" "Blue" "White" ...
 $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...
 $ State: Factor w/ 4 levels "Arizona","Colorado",..: 3 1 2 1 1 1 3 2 1 2 ...
R
carSpeeds$Color <- ifelse(carSpeeds$Color == 'Blue', 'Green', carSpeeds$Color)
carSpeeds$Color
OUTPUT
  [1] "Green" " Red"  "Green" "White" "Red"   "Green" "Green" "Black" "White"
 [10] "Red"   "Red"   "White" "Green" "Green" "Black" "Red"   "Green" "Green"
 [19] "White" "Green" "Green" "Green" "Red"   "Green" "Red"   "Red"   "Red"
 [28] "Red"   "White" "Green" "Red"   "White" "Black" "Red"   "Black" "Black"
 [37] "Green" "Red"   "Black" "Red"   "Black" "Black" "Red"   "Red"   "White"
 [46] "Black" "Green" "Red"   "Red"   "Black" "Black" "Red"   "White" "Red"
 [55] "Green" "Green" "Black" "Green" "White" "Black" "Red"   "Green" "Green"
 [64] "White" "Black" "Red"   "Red"   "Black" "Green" "White" "Green" "Red"
 [73] "White" "White" "Green" "Green" "Green" "Green" "Green" "White" "Black"
 [82] "Green" "White" "Black" "Black" "Red"   "Red"   "White" "White" "White"
 [91] "White" "Red"   "Red"   "Red"   "White" "Black" "White" "Black" "Black"
[100] "White"
R
carSpeeds$State <- ifelse(carSpeeds$State == 'Arizona', 'Ohio', carSpeeds$State)
carSpeeds$State
OUTPUT
  [1] "3"    "Ohio" "2"    "Ohio" "Ohio" "Ohio" "3"    "2"    "Ohio" "2"
 [11] "4"    "4"    "4"    "4"    "4"    "3"    "Ohio" "3"    "Ohio" "4"
 [21] "4"    "4"    "3"    "2"    "2"    "3"    "2"    "4"    "2"    "4"
 [31] "3"    "2"    "2"    "4"    "2"    "2"    "3"    "Ohio" "4"    "2"
 [41] "2"    "3"    "Ohio" "4"    "Ohio" "2"    "3"    "3"    "3"    "2"
 [51] "Ohio" "4"    "4"    "Ohio" "3"    "2"    "4"    "2"    "4"    "4"
 [61] "4"    "2"    "3"    "2"    "3"    "2"    "3"    "Ohio" "3"    "4"
 [71] "4"    "2"    "Ohio" "4"    "2"    "2"    "2"    "Ohio" "3"    "Ohio"
 [81] "4"    "2"    "2"    "Ohio" "Ohio" "Ohio" "4"    "Ohio" "4"    "4"
 [91] "4"    "Ohio" "Ohio" "3"    "2"    "2"    "4"    "3"    "Ohio" "4"   
We can see that $Color column is a character while
$State is a factor.
Updating Values in a Factor
Suppose we want to keep the colors of cars as factors for some other
operations we want to perform. Write code for replacing ‘Blue’ with
‘Green’ in the $Color column of the cars dataset without
importing the data with stringsAsFactors=FALSE.
R
carSpeeds <- read.csv(file = 'data/car-speeds.csv')
# Replace 'Blue' with 'Green' in cars$Color without using the stringsAsFactors
# or as.is arguments
carSpeeds$Color <- ifelse(as.character(carSpeeds$Color) == 'Blue',
                          'Green',
                          as.character(carSpeeds$Color))
# Convert colors back to factors
carSpeeds$Color <- as.factor(carSpeeds$Color)
The strip.white Argument
It is not uncommon for mistakes to have been made when the data were
recorded, for example a space (whitespace) may have been inserted before
a data value. By default this whitespace will be kept in the R
environment, such that ‘\ Red’ will be recognized as a different value
than ‘Red’. In order to avoid this type of error, use the
strip.white argument. Let’s see how this works by checking
for the unique values in the $Color column of our
dataset:
Here, the data recorder added a space before the color of the car in one of the cells:
R
# We use the built-in unique() function to extract the unique colors in our dataset
unique(carSpeeds$Color)
OUTPUT
[1] Green  Red  White Red   Black
Levels:  Red Black Green Red White
Oops, we see two values for red cars.
Let’s try again, this time importing the data using the
strip.white argument. NOTE - this argument must be
accompanied by the sep argument, by which we indicate the
type of delimiter in the file (the comma for most .csv files)
R
carSpeeds <- read.csv(
  file = 'data/car-speeds.csv',
  stringsAsFactors = FALSE,
  strip.white = TRUE,
  sep = ','
)
unique(carSpeeds$Color)
OUTPUT
[1] "Blue"  "Red"   "White" "Black"
That’s better!
Specify Missing Data When Loading
It is common for data sets to have missing values, or mistakes. The
convention for recording missing values often depends on the individual
who collected the data and can be recorded as n.a.,
--, or empty cells ” “. R recognises the reserved character
string NA as a missing value, but not some of the examples
above. Let’s say the inflammation scale in the data set we used earlier
inflammation-01.csv actually starts at 1 for
no inflammation and the zero values (0) were a missed
observation. Looking at the ?read.csv help page is there an
argument we could use to ensure all zeros (0) are read in
as NA? Perhaps, in the car-speeds.csv data
contains mistakes and the person measuring the car speeds could not
accurately distinguish between”Black or “Blue” cars. Is there a way to
specify more than one ‘string’, such as “Black” and “Blue”, to be
replaced by NA
R
read.csv(file = "data/inflammation-01.csv", na.strings = "0")
or , in car-speeds.csv use a character vector for
multiple values.
R
read.csv(
  file = 'data/car-speeds.csv',
  na.strings = c("Black", "Blue")
)
Write a New .csv and Explore the Arguments
After altering our cars dataset by replacing ‘Blue’ with ‘Green’ in
the $Color column, we now want to save the output. There
are several arguments for the write.csv(...) function call, a few of which
are particularly important for how the data are exported. Let’s explore
these now.
R
# Export the first rows of data. The write.csv() function requires a minimum of
# two arguments, the data to be saved and the name of the output file.
write.csv(head(carSpeeds), file = 'data/car-speeds-cleaned.csv')
If you open the file, you’ll see that it has header names, because the data had headers within R, but that there are numbers in the first column.
"","Color","Speed","State"
"1","Blue",32,"NewMexico"
"2","Red",45,"Arizona"
"3","Blue",35,"Colorado"
"4","White",34,"Arizona"
"5","Red",25,"Arizona"
"6","Blue",41,"Arizona"
The row.names Argument
This argument allows us to set the names of the rows in the output
data file. R’s default for this argument is TRUE, and since
it does not know what else to name the rows for the cars data set, it
resorts to using row numbers. To correct this, we can set
row.names to FALSE:
R
write.csv(head(carSpeeds), file = 'data/car-speeds-cleaned.csv', row.names = FALSE)
Now we see:
"Color","Speed","State"
"Blue",32,"NewMexico"
"Red",45,"Arizona"
"Blue",35,"Colorado"
"White",34,"Arizona"
"Red",25,"Arizona"
"Blue",41,"Arizona"
Setting Column Names
There is also a col.names argument, which can be used to
set the column names for a data set without headers. If the data set
already has headers (e.g., we used the headers = TRUE
argument when importing the data) then a col.names argument
will be ignored.
The na Argument
There are times when we want to specify certain values for
NAs in the data set (e.g., we are going to pass the data to
a program that only accepts -9999 as a nodata value). In this case, we
want to set the NA value of our output file to the desired
value, using the na argument. Let’s see how this works:
R
# First, replace the speed in the 3rd row with NA, by using an index (square
# brackets to indicate the position of the value we want to replace)
carSpeeds$Speed[3] <- NA
head(carSpeeds)
OUTPUT
  Color Speed     State
1  Blue    32 NewMexico
2   Red    45   Arizona
3  Blue    NA  Colorado
4 White    34   Arizona
5   Red    25   Arizona
6  Blue    41   Arizona
R
write.csv(carSpeeds, file = 'data/car-speeds-cleaned.csv', row.names = FALSE)
Now we’ll set NA to -9999 when we write the new .csv
file:
R
# Note - the na argument requires a string input
write.csv(head(carSpeeds),
          file = 'data/car-speeds-cleaned.csv',
          row.names = FALSE,
          na = '-9999')
And we see:
"Color","Speed","State"
"Blue",32,"NewMexico"
"Red",45,"Arizona"
"Blue",-9999,"Colorado"
"White",34,"Arizona"
"Red",25,"Arizona"
"Blue",41,"Arizona"
- Import data from a .csv file using the 
read.csv(...)function. - Understand some of the key arguments available for importing the
data properly, including 
header,stringsAsFactors,as.is, andstrip.white. - Write data to a new .csv file using the 
write.csv(...)function - Understand some of the key arguments available for exporting the
data properly, such as 
row.names,col.names, andna. 
Content from Understanding Factors
Last updated on 2025-10-07 | Edit this page
Estimated time: 25 minutes
Overview
Questions
- How is categorical data represented in R?
 - How do I work with factors?
 
Objectives
- Understand how to represent categorical data in R.
 - Know the difference between ordered and unordered factors.
 - Be aware of some of the problems encountered when using factors.
 
Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.
Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sortslevelsin alphabetical order. For instance, if you have a factor with 2 levels:
The factor() Command
The factor() command is used to create and modify
factors in R:
R
sex <- factor(c("male", "female", "female", "male"))
R will assign 1 to the level "female" and
2 to the level "male" (because f
comes before m, even though the first element in this
vector is "male"). You can check this by using the function
levels(), and check the number of levels using
nlevels():
R
levels(sex)
OUTPUT
[1] "female" "male"  
R
nlevels(sex)
OUTPUT
[1] 2
Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows us to compare levels:
R
food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)
OUTPUT
[1] "high"   "low"    "medium"
R
food <- factor(food, levels = c("low", "medium", "high"))
levels(food)
OUTPUT
[1] "low"    "medium" "high"  
R
min(food) # doesn't work
ERROR
Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), levels = c("low", : 'min' not meaningful for factors
R
food <- factor(food, levels = c("low", "medium", "high"), ordered = TRUE)
levels(food)
OUTPUT
[1] "low"    "medium" "high"  
R
min(food) # works!
OUTPUT
[1] low
Levels: low < medium < high
In R’s memory, these factors are represented by numbers (1, 2, 3).
They are better than using simple integer labels because factors are
self describing: "low", "medium", and
"high"” is more descriptive than 1,
2, 3. Which is low? You wouldn’t be able to
tell with just integer data. Factors have this information built in. It
is particularly helpful when there are many levels (like the subjects in
our example data set).
Representing Data in R
You have a vector representing levels of exercise undertaken by 5 subjects
“L”, “N”, “N”, “I”, “L” ; N=none, L=light, I=intense
What is the best way to represent this in R?
exercise <- c("L", "N", "N", "I", "L")exercise <- factor(c("L", "N", "N", "I", "L"), ordered = TRUE)exercise < -factor(c("L", "N", "N", "I", "L"), levels = c("N", "L", "I"), ordered = FALSE)exercise <- factor(c("L", "N", "N", "I", "L"), levels = c("N", "L", "I"), ordered = TRUE)
Correct solution is d.
R
exercise <- factor(c("L", "N", "N", "I", "L"), levels = c("N", "L", "I"), ordered = TRUE)
We only expect three categories (“N”, “L”, “I”). We can order these
from least intense to most intense, so let’s use
ordered.
Converting Factors
Converting from a factor to a number can cause problems:
R
f <- factor(c(3.4, 1.2, 5))
as.numeric(f)
OUTPUT
[1] 2 1 3
This does not behave as expected (and there is no warning).
The recommended way is to use the integer vector to index the factor levels:
R
levels(f)[f]
OUTPUT
[1] "3.4" "1.2" "5"  
This returns a character vector, the as.numeric()
function is still required to convert the values to the proper type
(numeric).
R
f <- levels(f)[f]
f <- as.numeric(f)
Using Factors
Lets load our example data to see the use of factors:
R
dat <- read.csv(file = 'data/sample.csv', stringsAsFactors = TRUE)
Default Behavior
stringsAsFactors = TRUE was the default behavior for R
prior to version 4.0. We are using it here to override the default
behaviour for R version 4.0 which is
stringsAsFactors = FALSE. It is included here for
clarity.
R
str(dat)
OUTPUT
'data.frame':	100 obs. of  9 variables:
 $ ID           : Factor w/ 100 levels "Sub001","Sub002",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Gender       : Factor w/ 4 levels "f","F","m","M": 3 3 3 1 3 4 1 3 3 1 ...
 $ Group        : Factor w/ 3 levels "Control","Treatment1",..: 1 3 3 2 2 3 1 3 3 1 ...
 $ BloodPressure: int  132 139 130 105 125 112 173 108 131 129 ...
 $ Age          : num  16 17.2 19.5 15.7 19.9 14.3 17.7 19.8 19.4 18.8 ...
 $ Aneurisms_q1 : int  114 148 196 199 188 260 135 216 117 188 ...
 $ Aneurisms_q2 : int  140 209 251 140 120 266 98 238 215 144 ...
 $ Aneurisms_q3 : int  202 248 122 233 222 320 154 279 181 192 ...
 $ Aneurisms_q4 : int  237 248 177 220 228 294 245 251 272 185 ...
Notice the first 3 columns have been converted to factors. These values were text in the data file so R automatically interpreted them as categorical variables.
R
summary(dat)
OUTPUT
       ID     Gender        Group    BloodPressure        Age
 Sub001 : 1   f:35   Control   :30   Min.   : 62.0   Min.   :12.10
 Sub002 : 1   F: 4   Treatment1:35   1st Qu.:107.5   1st Qu.:14.78
 Sub003 : 1   m:46   Treatment2:35   Median :117.5   Median :16.65
 Sub004 : 1   M:15                   Mean   :118.6   Mean   :16.42
 Sub005 : 1                          3rd Qu.:133.0   3rd Qu.:18.30
 Sub006 : 1                          Max.   :173.0   Max.   :20.00
 (Other):94
  Aneurisms_q1    Aneurisms_q2    Aneurisms_q3    Aneurisms_q4
 Min.   : 65.0   Min.   : 80.0   Min.   :105.0   Min.   :116.0
 1st Qu.:118.0   1st Qu.:131.5   1st Qu.:182.5   1st Qu.:186.8
 Median :158.0   Median :162.5   Median :217.0   Median :219.0
 Mean   :158.8   Mean   :168.0   Mean   :219.8   Mean   :217.9
 3rd Qu.:188.0   3rd Qu.:196.8   3rd Qu.:248.2   3rd Qu.:244.2
 Max.   :260.0   Max.   :283.0   Max.   :323.0   Max.   :315.0
                                                                
Notice the summary() function handles factors
differently to numbers (and strings), the occurrence counts for each
value is often more useful information.
The summary() Function
The summary() function is a great way of spotting errors
in your data (look at the dat$Gender column). It’s also a
great way for spotting missing data.
Reordering Factors
The function table() tabulates observations and can be
used to create bar plots quickly. For instance:
R
table(dat$Group)
OUTPUT
   Control Treatment1 Treatment2
        30         35         35 
R
barplot(table(dat$Group))
Use the factor() command to modify the column
dat$Group so that the control group is plotted
last.
R
dat$Group <- factor(dat$Group, levels = c("Treatment1", "Treatment2", "Control"))
barplot(table(dat$Group))
Removing Levels from a Factor
Some of the Gender values in our dataset have been coded incorrectly. Let’s remove levels from this factor.
R
barplot(table(dat$Gender))
Values should have been recorded as lowercase ‘m’ and ‘f’. We should correct this.
R
dat$Gender[dat$Gender == 'M'] <- 'm'
Updating Factors
R
plot(x = dat$Gender, y = dat$BloodPressure)
Why does this plot show 4 levels?
How many levels does dat$Gender have?
dat$Gender has 4 levels, so the plot shows 4 levels.
We need to tell R that “M” is no longer a valid value for this
column. We use the droplevels() function to remove extra
levels.
R
dat$Gender <- droplevels(dat$Gender)
plot(x = dat$Gender, y = dat$BloodPressure)
Adjusting Factor Levels
Adjusting the levels() of a factor provides a useful
shortcut for reassigning values in this case.
R
levels(dat$Gender)[2] <- 'f'
plot(x = dat$Gender, y = dat$BloodPressure)
- Factors are used to represent categorical data.
 - Factors can be ordered or unordered.
 - Some R functions have special methods for handling factors.
 
Content from Data Types and Structures
Last updated on 2025-10-07 | Edit this page
Estimated time: 55 minutes
Overview
Questions
- What are the different data types in R?
 - What are the different data structures in R?
 - How do I access data within the various data structures?
 
Objectives
- Expose learners to the different data types in R and show how these data types are used in data structures.
 - Learn how to create vectors of different types.
 - Be able to check the type of vector.
 - Learn about missing data and other special values.
 - Get familiar with the different data structures (lists, matrices, data frames).
 
Understanding Basic Data Types and Data Structures in R
To make the best of the R language, you’ll need a strong understanding of the basic data types and data structures and how to operate on them.
Data structures are very important to understand because these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.
Everything in R is an object.
R has 6 basic data types. (In addition to the five listed below, there is also raw which will not be discussed in this workshop.)
- character
 - numeric (real or decimal)
 - integer
 - logical
 - complex
 
Elements of these data types may be combined to form data structures, such as atomic vectors. When we call a vector atomic, we mean that the vector only holds data of a single data type. Below are examples of atomic character vectors, numeric vectors, integer vectors, etc.
- 
character: 
"a","swc" - 
numeric: 
2,15.5 - 
integer: 
2L(theLtells R to store this as an integer) - 
logical: 
TRUE,FALSE - 
complex: 
1+4i(complex numbers with real and imaginary parts) 
R provides many functions to examine features of vectors and other objects, for example
- 
class()- what kind of object is it (high-level)? - 
typeof()- what is the object’s data type (low-level)? - 
length()- how long is it? What about two dimensional objects? - 
attributes()- does it have any metadata? 
R
# Example
x <- "dataset"
typeof(x)
OUTPUT
[1] "character"
R
attributes(x)
OUTPUT
NULL
R
y <- 1:10
y
OUTPUT
 [1]  1  2  3  4  5  6  7  8  9 10
R
typeof(y)
OUTPUT
[1] "integer"
R
length(y)
OUTPUT
[1] 10
R
z <- as.numeric(y)
z
OUTPUT
 [1]  1  2  3  4  5  6  7  8  9 10
R
typeof(z)
OUTPUT
[1] "double"
R has many data structures. These include
- atomic vector
 - list
 - matrix
 - data frame
 - factors
 
Vectors
A vector is the most common and basic data structure in R and is pretty much the workhorse of R. Technically, vectors can be one of two types:
- atomic vectors
 - lists
 
although the term “vector” most commonly refers to the atomic types not to lists.
The Different Vector Modes
A vector is a collection of elements that are most commonly of mode
character, logical, integer or
numeric.
You can create an empty vector with vector(). (By
default the mode is logical. You can be more explicit as
shown in the examples below.)It is more common to use direct
constructors such as character(), numeric(),
etc.
R
vector() # an empty 'logical' (the default) vector
OUTPUT
logical(0)
R
vector("character", length = 5) # a vector of mode 'character' with 5 elements
OUTPUT
[1] "" "" "" "" ""
R
character(5) # the same thing, but using the constructor directly
OUTPUT
[1] "" "" "" "" ""
R
numeric(5)   # a numeric vector with 5 elements
OUTPUT
[1] 0 0 0 0 0
R
logical(5)   # a logical vector with 5 elements
OUTPUT
[1] FALSE FALSE FALSE FALSE FALSE
You can also create vectors by directly specifying their content. R will then guess the appropriate mode of storage for the vector. For instance:
R
x <- c(1, 2, 3)
will create a vector x of mode numeric.
These are the most common kind, and are treated as double precision real
numbers. If you wanted to explicitly create integers, you need to add an
L to each element (or coerce to the integer type
using as.integer()).
R
x1 <- c(1L, 2L, 3L)
Using TRUE and FALSE will create a vector
of mode logical:
R
y <- c(TRUE, TRUE, FALSE, FALSE)
While using quoted text will create a vector of mode
character:
R
z <- c("Sarah", "Tracy", "Jon")
Examining Vectors
The functions typeof(), length(),
class() and str() provide useful information
about your vectors and R objects in general.
R
typeof(z)
OUTPUT
[1] "character"
R
length(z)
OUTPUT
[1] 3
R
class(z)
OUTPUT
[1] "character"
R
str(z)
OUTPUT
 chr [1:3] "Sarah" "Tracy" "Jon"
Adding Elements
The function c() (for combine) can also be used to add
elements to a vector.
R
z <- c(z, "Annette")
z
OUTPUT
[1] "Sarah"   "Tracy"   "Jon"     "Annette"
R
z <- c("Greg", z)
z
OUTPUT
[1] "Greg"    "Sarah"   "Tracy"   "Jon"     "Annette"
Vectors from a Sequence of Numbers
You can create vectors as a sequence of numbers.
R
series <- 1:10
seq(10)
OUTPUT
 [1]  1  2  3  4  5  6  7  8  9 10
R
seq(from = 1, to = 10, by = 0.1)
OUTPUT
 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
[91] 10.0
Missing Data
R supports missing data in vectors. They are represented as
NA (Not Available) and can be used for all the vector types
covered in this lesson:
R
x <- c(0.5, NA, 0.7)
x <- c(TRUE, FALSE, NA)
x <- c("a", NA, "c", "d", "e")
x <- c(1+5i, 2-3i, NA)
The function is.na() indicates the elements of the
vectors that represent missing data, and the function
anyNA() returns TRUE if the vector contains
any missing values:
R
x <- c("a", NA, "c", "d", NA)
y <- c("a", "b", "c", "d", "e")
is.na(x)
OUTPUT
[1] FALSE  TRUE FALSE FALSE  TRUE
R
is.na(y)
OUTPUT
[1] FALSE FALSE FALSE FALSE FALSE
R
anyNA(x)
OUTPUT
[1] TRUE
R
anyNA(y)
OUTPUT
[1] FALSE
Other Special Values
Inf is infinity. You can have either positive or
negative infinity.
R
1/0
OUTPUT
[1] Inf
NaN means Not a Number. It’s an undefined value.
R
0/0
OUTPUT
[1] NaN
What Happens When You Mix Types Inside a Vector?
R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This conversion between modes of storage is called “coercion”. When R converts the mode of storage based on its content, it is referred to as “implicit coercion”. For instance, can you guess what the following do (without running them first)?
R
xx <- c(1.7, "a")
xx <- c(TRUE, 2)
xx <- c("a", TRUE)
You can also control how vectors are coerced explicitly using the
as.<class_name>() functions:
R
as.numeric("1")
OUTPUT
[1] 1
R
as.character(1:2)
OUTPUT
[1] "1" "2"
Finding Commonalities
Do you see a property that’s common to all these vectors above?
All vectors are one-dimensional and each element is of the same type.
Objects Attributes
Objects can have attributes. Attributes are part of the object. These include:
- names
 - dimnames
 - dim
 - class
 - attributes (contain metadata)
 
You can also glean other attribute-like information such as length (works on vectors and lists) or number of characters (for character strings).
R
length(1:10)
OUTPUT
[1] 10
R
nchar("Software Carpentry")
OUTPUT
[1] 18
Matrix
In R matrices are an extension of the numeric or character vectors. They are not a separate type of object but simply an atomic vector with dimensions; the number of rows and columns. As with atomic vectors, the elements of a matrix must be of the same data type.
R
m <- matrix(nrow = 2, ncol = 2)
m
OUTPUT
     [,1] [,2]
[1,]   NA   NA
[2,]   NA   NA
R
dim(m)
OUTPUT
[1] 2 2
You can check that matrices are vectors with a class attribute of
matrix by using class() and
typeof().
R
m <- matrix(c(1:3))
class(m)
OUTPUT
[1] "matrix" "array" 
R
typeof(m)
OUTPUT
[1] "integer"
While class() shows that m is a matrix,
typeof() shows that fundamentally the matrix is an integer
vector.
Data types of matrix elements
Consider the following matrix:
R
FOURS <- matrix(
  c(4, 4, 4, 4),
  nrow = 2,
  ncol = 2
)
Given that typeof(FOURS[1]) returns
"double", what would you expect
typeof(FOURS)to return? How do you know this is the case
even without running this code?
Hint Can matrices be composed of elements of different data types?
We know that typeof(FOURS) will also return
"double" since matrices are made of elements of the same
data type. Note that you could do something like
as.character(FOURS) if you needed the elements of
FOURS as characters.
Matrices in R are filled column-wise.
R
m <- matrix(1:6, nrow = 2, ncol = 3)
Other ways to construct a matrix
R
m      <- 1:10
dim(m) <- c(2, 5)
This takes a vector and transforms it into a matrix with 2 rows and 5 columns.
Another way is to bind columns or rows using rbind() and
cbind() (“row bind” and “column bind”, respectively).
R
x <- 1:3
y <- 10:12
cbind(x, y)
OUTPUT
     x  y
[1,] 1 10
[2,] 2 11
[3,] 3 12
R
rbind(x, y)
OUTPUT
  [,1] [,2] [,3]
x    1    2    3
y   10   11   12
You can also use the byrow argument to specify how the
matrix is filled. From R’s own documentation:
R
mdat <- matrix(c(1, 2, 3, 11, 12, 13),
               nrow = 2,
               ncol = 3,
               byrow = TRUE)
mdat
OUTPUT
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]   11   12   13
Elements of a matrix can be referenced by specifying the index along each dimension (e.g. “row” and “column”) in single square brackets.
R
mdat[2, 3]
OUTPUT
[1] 13
List
In R lists act as containers. Unlike atomic vectors, the contents of a list are not restricted to a single mode and can encompass any mixture of data types. Lists are sometimes called generic vectors, because the elements of a list can by of any type of R object, even lists containing further lists. This property makes them fundamentally different from atomic vectors.
A list is a special type of vector. Each element can be a different type.
Create lists using list() or coerce other objects using
as.list(). An empty list of the required length can be
created using vector()
R
x <- list(1, "a", TRUE, 1+4i)
x
OUTPUT
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i
R
x <- vector("list", length = 5) # empty list
length(x)
OUTPUT
[1] 5
The content of elements of a list can be retrieved by using double square brackets.
R
x[[1]]
OUTPUT
NULL
Vectors can be coerced to lists as follows:
R
x <- 1:10
x <- as.list(x)
length(x)
OUTPUT
[1] 10
Examining Lists
- What is the class of 
x[1]? - What is the class of 
x[[1]]? 
- 
R
class(x[1])OUTPUT
[1] "list" - 
R
class(x[[1]])OUTPUT
[1] "integer" 
Elements of a list can be named (i.e. lists can have the
names attribute)
R
xlist <- list(a = "Karthik Ram", b = 1:10, data = head(mtcars))
xlist
OUTPUT
$a
[1] "Karthik Ram"
$b
 [1]  1  2  3  4  5  6  7  8  9 10
$data
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
R
names(xlist)
OUTPUT
[1] "a"    "b"    "data"
Examining Named Lists
- What is the length of this object?
 - What is its structure?
 
- 
R
length(xlist)OUTPUT
[1] 3 - 
R
str(xlist)OUTPUT
List of 3 $ a : chr "Karthik Ram" $ b : int [1:10] 1 2 3 4 5 6 7 8 9 10 $ data:'data.frame': 6 obs. of 11 variables: ..$ mpg : num [1:6] 21 21 22.8 21.4 18.7 18.1 ..$ cyl : num [1:6] 6 6 4 6 8 6 ..$ disp: num [1:6] 160 160 108 258 360 225 ..$ hp : num [1:6] 110 110 93 110 175 105 ..$ drat: num [1:6] 3.9 3.9 3.85 3.08 3.15 2.76 ..$ wt : num [1:6] 2.62 2.88 2.32 3.21 3.44 ... ..$ qsec: num [1:6] 16.5 17 18.6 19.4 17 ... ..$ vs : num [1:6] 0 0 1 1 0 1 ..$ am : num [1:6] 1 1 1 0 0 0 ..$ gear: num [1:6] 4 4 4 3 3 3 ..$ carb: num [1:6] 4 4 1 1 2 1 
Lists can be extremely useful inside functions. Because the functions in R are able to return only a single object, you can “staple” together lots of different kinds of results into a single object that a function can return.
A list does not print to the console like a vector. Instead, each element of the list starts on a new line.
Elements are indexed by double brackets. Single brackets will still
return a(nother) list. If the elements of a list are named, they can be
referenced by the $ notation
(i.e. xlist$data).
Data Frame
A data frame is a very important data type in R. It’s pretty much the de facto data structure for most tabular data and what we use for statistics.
A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list).
Data frames can have additional attributes such as
rownames(), which can be useful for annotating data, like
subject_id or sample_id. But most of the time
they are not used.
Some additional information on data frames:
- Usually created by 
read.csv()andread.table(), i.e. when importing the data into R. - Assuming all columns in a data frame are of same type, data frame can be converted to a matrix with data.matrix() (preferred) or as.matrix(). Otherwise type coercion will be enforced and the results may not always be what you expect.
 - Can also create a new data frame with 
data.frame()function. - Find the number of rows and columns with 
nrow(dat)andncol(dat), respectively. - Rownames are often automatically generated and look like 1, 2, …, n. Consistency in numbering of rownames may not be honored when rows are reshuffled or subset.
 
Creating Data Frames by Hand
To create data frames by hand:
R
dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)
dat
OUTPUT
   id  x  y
1   a  1 11
2   b  2 12
3   c  3 13
4   d  4 14
5   e  5 15
6   f  6 16
7   g  7 17
8   h  8 18
9   i  9 19
10  j 10 20
Useful Data Frame Functions
- 
head()- shows first 6 rows - 
tail()- shows last 6 rows - 
dim()- returns the dimensions of data frame (i.e. number of rows and number of columns) - 
nrow()- number of rows - 
ncol()- number of columns - 
str()- structure of data frame - name, type and preview of data in each column - 
names()orcolnames()- both show thenamesattribute for a data frame - 
sapply(dataframe, class)- shows the class of each column in the data frame 
See that it is actually a special list:
R
is.list(dat)
OUTPUT
[1] TRUE
R
class(dat)
OUTPUT
[1] "data.frame"
Because data frames are rectangular, elements of data frame can be referenced by specifying the row and the column index in single square brackets (similar to matrix).
R
dat[1, 3]
OUTPUT
[1] 11
As data frames are also lists, it is possible to refer to columns
(which are elements of such list) using the list notation, i.e. either
double square brackets or a $.
R
dat[["y"]]
OUTPUT
 [1] 11 12 13 14 15 16 17 18 19 20
R
dat$y
OUTPUT
 [1] 11 12 13 14 15 16 17 18 19 20
The following table summarizes the one-dimensional and two-dimensional data structures in R in relation to diversity of data types they can contain.
| Dimensions | Homogenous | Heterogeneous | 
|---|---|---|
| 1-D | atomic vector | list | 
| 2-D | matrix | data frame | 
Lists can contain elements that are themselves muti-dimensional (e.g. a lists can contain data frames or another type of objects). Lists can also contain elements of any length, therefore list do not necessarily have to be “rectangular”. However in order for the list to qualify as a data frame, the length of each element has to be the same.
Column Types in Data Frames
Knowing that data frames are lists, can columns be of different type?
What type of structure do you expect to see when you explore the
structure of the PlantGrowth data frame? Hint: Use
str().
The weight column is numeric while group is a factor. Lists can have elements of different types. Since a Data Frame is just a special type of list, it can have columns of differing type (although, remember that type must be consistent within each column!).
R
str(PlantGrowth)
OUTPUT
'data.frame':	30 obs. of  2 variables:
 $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
 $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
- R’s basic data types are character, numeric, integer, complex, and logical.
 - R’s basic data structures include the vector, list, matrix, data frame, and factors. Some of these structures require that all members be of the same data type (e.g. vectors, matrices) while others permit multiple data types (e.g. lists, data frames).
 - Objects may have attributes, such as name, dimension, and class.
 
Content from The Call Stack
Last updated on 2025-10-07 | Edit this page
Estimated time: 20 minutes
Overview
Questions
- What is the call stack, and how does R know what order to do things in?
 - How does scope work in R?
 
Objectives
- Explain how stack frames are created and destroyed as functions are called.
 - Correctly identify the scope of a function’s local variables.
 - Explain variable scope in terms of the call stack.
 
The Call Stack
Let’s take a closer look at what happens when we call
fahrenheit_to_kelvin(32). To make things clearer, we’ll
start by putting the initial value 32 in a variable and store the final
result in one as well:
R
original <- 32
final <- fahrenheit_to_kelvin(original)
The diagram below shows what memory looks like after the first line has been executed:
When we call fahrenheit_to_kelvin, R doesn’t
create the variable temp_F right away. Instead, it creates
something called a stack frame
to keep track of the variables defined by
fahrenheit_to_kelvin. Initially, this stack frame only
holds the value of temp_F:
When we call fahrenheit_to_celsius inside
fahrenheit_to_kelvin, R creates another stack frame to hold
fahrenheit_to_celsius’s variables:
It does this because there are now two variables in play called
temp_F: the argument to fahrenheit_to_celsius,
and the argument to fahrenheit_to_kelvin. Having two
variables with the same name in the same part of the program would be
ambiguous, so R (and every other modern programming language) creates a
new stack frame for each function call to keep that function’s variables
separate from those defined by other functions.
When the call to fahrenheit_to_celsius returns a value,
R throws away fahrenheit_to_celsius’s stack frame and
creates a new variable in the stack frame for
fahrenheit_to_kelvin to hold the temperature in
Celsius:
It then calls celsius_to_kelvin, which means it creates
a stack frame to hold that function’s variables:
Once again, R throws away that stack frame when
celsius_to_kelvin is done and creates the variable
temp_K in the stack frame for
fahrenheit_to_kelvin:
Finally, when fahrenheit_to_kelvin is done, R throws
away its stack frame and puts its result in a new variable
called final that lives in the stack frame we started
with:
This final stack frame is always there; it holds the variables we
defined outside the functions in our code. What it doesn’t hold
is the variables that were in the various stack frames. If we try to get
the value of temp_F after our functions have finished
running, R tells us that there’s no such thing:
R
temp_F
ERROR
Error: object 'temp_F' not found
Where to Learn More
The explanation of the stack frame above was very general and the basic concept will help you understand most languages you try to program with. However, R has some unique aspects that can be exploited when performing more complicated operations. We will not be writing anything that requires knowledge of these more advanced concepts. In the future when you are comfortable writing functions in R, you can learn more by reading the R Language Manual or this chapter from Advanced R Programmingby Hadley Wickham. For context, R uses the terminology “environments” instead of frames.
Why go to all this trouble? Well, here’s a function called
span that calculates the difference between the minimum and
maximum values in an array:
R
span <- function(a) {
  diff <- max(a) - min(a)
  return(diff)
}
dat <- read.csv(file = "data/inflammation-01.csv", header = FALSE)
# span of inflammation data
span(dat)
OUTPUT
[1] 20
Notice span assigns a value to variable called
diff. We might very well use a variable with the same name
(diff) to hold the inflammation data:
R
diff <- read.csv(file = "data/inflammation-01.csv", header = FALSE)
# span of inflammation data
span(diff)
OUTPUT
[1] 20
We don’t expect the variable diff to have the value 20
after this function call, so the name diff cannot refer to
the same variable defined inside span as it does in the
main body of our program (which R refers to as the global environment).
And yes, we could probably choose a different name than
diff for our variable in this case, but we don’t want to
have to read every line of code of the R functions we call to see what
variable names they use, just in case they change the values of our
variables.
The big idea here is encapsulation, and it’s the key to writing correct, comprehensible programs. A function’s job is to turn several operations into one so that we can think about a single function call instead of a dozen or a hundred statements each time we want to do something. That only works if functions don’t interfere with each other; if they do, we have to pay attention to the details once again, which quickly overloads our short-term memory.
Following the Call Stack
We previously wrote functions called highlight and
edges. Draw a diagram showing how the call stack changes
when we run the following:
R
inner_vec <- "carbon"
outer_vec <- "+"
result <- edges(highlight(inner_vec, outer_vec))
- R keeps track of active function calls using a call stack comprised of stack frames.
 - Only global variables and variables in the current stack frame can be accessed directly.
 
Content from Loops in R
Last updated on 2025-10-07 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- How can I do the same thing multiple times more efficiently in R?
 - What is vectorization?
 - Should I use a loop or an 
applystatement? 
Objectives
- Compare loops and vectorized operations.
 - Use the apply family of functions.
 
In R you have multiple options when repeating calculations:
vectorized operations, for loops, and apply
functions.
This lesson is an extension of Analyzing
Multiple Data Sets. In that lesson, we introduced how to run a
custom function, analyze, over multiple data files:
R
analyze <- function(filename) {
  # Plots the average, min, and max inflammation over time.
  # Input is character string of a csv file.
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
}
R
filenames <- list.files(path = "data", pattern = "inflammation-[0-9]{2}.csv", full.names = TRUE)
Vectorized Operations
A key difference between R and many other languages is a topic known
as vectorization. When you wrote the total function, we
mentioned that R already has sum to do this;
sum is much faster than the interpreted
for loop because sum is coded in C to work
with a vector of numbers. Many of R’s functions work this way; the loop
is hidden from you in C. Learning to use vectorized operations is a key
skill in R.
For example, to add pairs of numbers contained in two vectors
R
a <- 1:10
b <- 1:10
You could loop over the pairs adding each in turn, but that would be very inefficient in R.
Instead of using i in a to make our loop variable, we
use the function seq_along to generate indices for each
element a contains.
R
res <- numeric(length = length(a))
for (i in seq_along(a)) {
  res[i] <- a[i] + b[i]
}
res
OUTPUT
 [1]  2  4  6  8 10 12 14 16 18 20
Instead, + is a vectorized function which can
operate on entire vectors at once
R
res2 <- a + b
all.equal(res, res2)
OUTPUT
[1] TRUE
Vector Recycling
When performing vector operations in R, it is important to know about recycling. If you perform an operation on two or more vectors of unequal length, R will recycle elements of the shorter vector(s) to match the longest vector. For example:
R
a <- 1:10
b <- 1:5
a + b
OUTPUT
 [1]  2  4  6  8 10  7  9 11 13 15
The elements of a and bare added together
starting from the first element of both vectors. When R reaches the end
of the shorter vectorb, it starts again at the first
element of b and continues until it reaches the last
element of the longest vector a. This behaviour may seem
crazy at first glance, but it is very useful when you want to perform
the same operation on every element of a vector. For example, say we
want to multiply every element of our vector a by 5:
R
a <- 1:10
b <- 5
a * b
OUTPUT
 [1]  5 10 15 20 25 30 35 40 45 50
Remember there are no scalars in R, so b is actually a
vector of length 1; in order to add its value to every element of
a, it is recycled to match the length of
a.
When the length of the longer object is a multiple of the shorter object length (as in our example above), the recycling occurs silently. When the longer object length is not a multiple of the shorter object length, a warning is given:
R
a <- 1:10
b <- 1:7
a + b
WARNING
Warning in a + b: longer object length is not a multiple of shorter object
length
OUTPUT
 [1]  2  4  6  8 10 12 14  9 11 13
for or apply?
A for loop is used to apply the same function calls to a
collection of objects. R has a family of functions, the
apply family, which can be used in much the same way.
You’ve already used one of the family, apply in the first
lesson. The apply
family members include
- 
apply- apply over the margins of an array (e.g. the rows or columns of a matrix) - 
lapply- apply over an object and return list - 
sapply- apply over an object and return a simplified object (an array) if possible - 
vapply- similar tosapplybut you specify the type of object returned by the iterations 
Each of these has an argument FUN which takes a function
to apply to each element of the object. Instead of looping over
filenames and calling analyze, as you did
earlier, you could sapply over filenames with
FUN = analyze:
R
sapply(filenames, FUN = analyze)
Deciding whether to use for or one of the
apply family is really personal preference. Using an
apply family function forces to you encapsulate your
operations as a function rather than separate calls with
for. for loops are often more natural in some
circumstances; for several related operations, a for loop
will avoid you having to pass in a lot of extra arguments to your
function.
Loops in R Are Slow
No, they are not! If you follow some golden rules:
- Don’t use a loop when a vectorized alternative exists
 - Don’t grow objects (via 
c,cbind, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column - Allocate an object to hold the results and fill it in during the loop
 
As an example, we’ll create a new version of analyze
that will return the mean inflammation per day (column) of each
file.
R
analyze2 <- function(filenames) {
  for (f in seq_along(filenames)) {
    fdata <- read.csv(filenames[f], header = FALSE)
    res <- apply(fdata, 2, mean)
    if (f == 1) {
      out <- res
    } else {
      # The loop is slowed by this call to cbind that grows the object
      out <- cbind(out, res)
    }
  }
  return(out)
}
system.time(avg2 <- analyze2(filenames))
OUTPUT
   user  system elapsed
  0.022   0.001   0.023 
Note how we add a new column to out at each iteration?
This is a cardinal sin of writing a for loop in R.
Instead, we can create an empty matrix with the right dimensions
(rows/columns) to hold the results. Then we loop over the files but this
time we fill in the fth column of our results matrix
out. This time there is no copying/growing for R to deal
with.
R
analyze3 <- function(filenames) {
  out <- matrix(ncol = length(filenames), nrow = 40) # assuming 40 here from files
  for (f in seq_along(filenames)) {
    fdata <- read.csv(filenames[f], header = FALSE)
    out[, f] <- apply(fdata, 2, mean)
  }
  return(out)
}
system.time(avg3 <- analyze3(filenames))
OUTPUT
   user  system elapsed
  0.022   0.001   0.023 
In this simple example there is little difference in the compute time
of analyze2 and analyze3. This is because we
are only iterating over 12 files and hence we only incur 12 copy/grow
operations. If we were doing this over more files or the data objects we
were growing were larger, the penalty for copying/growing would be much
larger.
Note that apply handles these memory allocation issues
for you, but then you have to write the loop part as a function to pass
to apply. At its heart, apply is just a
for loop with extra convenience.
- Where possible, use vectorized operations instead of
forloops to make code faster and more concise. - Use functions such as 
applyinstead offorloops to operate on the values in a data structure.