Loops in R
Last updated on 2024-12-03 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- How can I do the same thing multiple times more efficiently in R?
- What is vectorization?
- Should I use a loop or an
apply
statement?
Objectives
- Compare loops and vectorized operations.
- Use the apply family of functions.
In R you have multiple options when repeating calculations:
vectorized operations, for
loops, and apply
functions.
This lesson is an extension of Analyzing
Multiple Data Sets. In that lesson, we introduced how to run a
custom function, analyze
, over multiple data files:
R
analyze <- function(filename) {
# Plots the average, min, and max inflammation over time.
# Input is character string of a csv file.
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
}
R
filenames <- list.files(path = "data", pattern = "inflammation-[0-9]{2}.csv", full.names = TRUE)
Vectorized Operations
A key difference between R and many other languages is a topic known
as vectorization. When you wrote the total
function, we
mentioned that R already has sum
to do this;
sum
is much faster than the interpreted
for
loop because sum
is coded in C to work
with a vector of numbers. Many of R’s functions work this way; the loop
is hidden from you in C. Learning to use vectorized operations is a key
skill in R.
For example, to add pairs of numbers contained in two vectors
R
a <- 1:10
b <- 1:10
You could loop over the pairs adding each in turn, but that would be very inefficient in R.
Instead of using i in a
to make our loop variable, we
use the function seq_along
to generate indices for each
element a
contains.
R
res <- numeric(length = length(a))
for (i in seq_along(a)) {
res[i] <- a[i] + b[i]
}
res
OUTPUT
[1] 2 4 6 8 10 12 14 16 18 20
Instead, +
is a vectorized function which can
operate on entire vectors at once
R
res2 <- a + b
all.equal(res, res2)
OUTPUT
[1] TRUE
Vector Recycling
When performing vector operations in R, it is important to know about recycling. If you perform an operation on two or more vectors of unequal length, R will recycle elements of the shorter vector(s) to match the longest vector. For example:
R
a <- 1:10
b <- 1:5
a + b
OUTPUT
[1] 2 4 6 8 10 7 9 11 13 15
The elements of a
and b
are added together
starting from the first element of both vectors. When R reaches the end
of the shorter vector b
, it starts again at the first
element of b
and continues until it reaches the last
element of the longest vector a
. This behaviour may seem
crazy at first glance, but it is very useful when you want to perform
the same operation on every element of a vector. For example, say we
want to multiply every element of our vector a
by 5:
R
a <- 1:10
b <- 5
a * b
OUTPUT
[1] 5 10 15 20 25 30 35 40 45 50
Remember there are no scalars in R, so b
is actually a
vector of length 1; in order to add its value to every element of
a
, it is recycled to match the length of
a
.
When the length of the longer object is a multiple of the shorter object length (as in our example above), the recycling occurs silently. When the longer object length is not a multiple of the shorter object length, a warning is given:
R
a <- 1:10
b <- 1:7
a + b
WARNING
Warning in a + b: longer object length is not a multiple of shorter object
length
OUTPUT
[1] 2 4 6 8 10 12 14 9 11 13
for
or apply
?
A for
loop is used to apply the same function calls to a
collection of objects. R has a family of functions, the
apply
family, which can be used in much the same way.
You’ve already used one of the family, apply
in the first
lesson. The apply
family members include
-
apply
- apply over the margins of an array (e.g. the rows or columns of a matrix) -
lapply
- apply over an object and return list -
sapply
- apply over an object and return a simplified object (an array) if possible -
vapply
- similar tosapply
but you specify the type of object returned by the iterations
Each of these has an argument FUN
which takes a function
to apply to each element of the object. Instead of looping over
filenames
and calling analyze
, as you did
earlier, you could sapply
over filenames
with
FUN = analyze
:
R
sapply(filenames, FUN = analyze)
Deciding whether to use for
or one of the
apply
family is really personal preference. Using an
apply
family function forces to you encapsulate your
operations as a function rather than separate calls with
for
. for
loops are often more natural in some
circumstances; for several related operations, a for
loop
will avoid you having to pass in a lot of extra arguments to your
function.
Loops in R Are Slow
No, they are not! If you follow some golden rules:
- Don’t use a loop when a vectorized alternative exists
- Don’t grow objects (via
c
,cbind
, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column - Allocate an object to hold the results and fill it in during the loop
As an example, we’ll create a new version of analyze
that will return the mean inflammation per day (column) of each
file.
R
analyze2 <- function(filenames) {
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
res <- apply(fdata, 2, mean)
if (f == 1) {
out <- res
} else {
# The loop is slowed by this call to cbind that grows the object
out <- cbind(out, res)
}
}
return(out)
}
system.time(avg2 <- analyze2(filenames))
OUTPUT
user system elapsed
0.022 0.000 0.022
Note how we add a new column to out
at each iteration?
This is a cardinal sin of writing a for
loop in R.
Instead, we can create an empty matrix with the right dimensions
(rows/columns) to hold the results. Then we loop over the files but this
time we fill in the f
th column of our results matrix
out
. This time there is no copying/growing for R to deal
with.
R
analyze3 <- function(filenames) {
out <- matrix(ncol = length(filenames), nrow = 40) # assuming 40 here from files
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
out[, f] <- apply(fdata, 2, mean)
}
return(out)
}
system.time(avg3 <- analyze3(filenames))
OUTPUT
user system elapsed
0.021 0.000 0.022
In this simple example there is little difference in the compute time
of analyze2
and analyze3
. This is because we
are only iterating over 12 files and hence we only incur 12 copy/grow
operations. If we were doing this over more files or the data objects we
were growing were larger, the penalty for copying/growing would be much
larger.
Note that apply
handles these memory allocation issues
for you, but then you have to write the loop part as a function to pass
to apply
. At its heart, apply
is just a
for
loop with extra convenience.
Key Points
- Where possible, use vectorized operations instead of
for
loops to make code faster and more concise. - Use functions such as
apply
instead offor
loops to operate on the values in a data structure.