# Loops in R

## Overview

Teaching: 30 min
Exercises: 0 min
Questions
• How can I do the same thing multiple times more efficiently in R?

• What is vectorization?

• Should I use a loop or an `apply` statement?

Objectives
• Compare loops and vectorized operations.

• Use the apply family of functions.

In R you have multiple options when repeating calculations: vectorized operations, `for` loops, and `apply` functions.

This lesson is an extension of Analyzing Multiple Data Sets. In that lesson, we introduced how to run a custom function, `analyze`, over multiple data files:

``````analyze <- function(filename) {
# Plots the average, min, and max inflammation over time.
# Input is character string of a csv file.
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
}
``````
``````filenames <- list.files(path = "data", pattern = "inflammation-[0-9]{2}.csv", full.names = TRUE)
``````

### Vectorized Operations

A key difference between R and many other languages is a topic known as vectorization. When you wrote the `total` function, we mentioned that R already has `sum` to do this; `sum` is much faster than the interpreted `for` loop because `sum` is coded in C to work with a vector of numbers. Many of R’s functions work this way; the loop is hidden from you in C. Learning to use vectorized operations is a key skill in R.

For example, to add pairs of numbers contained in two vectors

``````a <- 1:10
b <- 1:10
``````

You could loop over the pairs adding each in turn, but that would be very inefficient in R.

Instead of using `i in a` to make our loop variable, we use the function `seq_along` to generate indices for each element `a` contains.

``````res <- numeric(length = length(a))
for (i in seq_along(a)) {
res[i] <- a[i] + b[i]
}
res
``````
`````` [1]  2  4  6  8 10 12 14 16 18 20
``````

Instead, `+` is a vectorized function which can operate on entire vectors at once

``````res2 <- a + b
all.equal(res, res2)
``````
``````[1] TRUE
``````

### Vector Recycling

When performing vector operations in R, it is important to know about recycling. If you perform an operation on two or more vectors of unequal length, R will recycle elements of the shorter vector(s) to match the longest vector. For example:

``````a <- 1:10
b <- 1:5
a + b
``````
`````` [1]  2  4  6  8 10  7  9 11 13 15
``````

The elements of `a` and `b` are added together starting from the first element of both vectors. When R reaches the end of the shorter vector `b`, it starts again at the first element of `b` and continues until it reaches the last element of the longest vector `a`. This behaviour may seem crazy at first glance, but it is very useful when you want to perform the same operation on every element of a vector. For example, say we want to multiply every element of our vector `a` by 5:

``````a <- 1:10
b <- 5
a * b
``````
`````` [1]  5 10 15 20 25 30 35 40 45 50
``````

Remember there are no scalars in R, so `b` is actually a vector of length 1; in order to add its value to every element of `a`, it is recycled to match the length of `a`.

When the length of the longer object is a multiple of the shorter object length (as in our example above), the recycling occurs silently. When the longer object length is not a multiple of the shorter object length, a warning is given:

``````a <- 1:10
b <- 1:7
a + b
``````
``````Warning in a + b: longer object length is not a multiple of shorter object
length
``````
`````` [1]  2  4  6  8 10 12 14  9 11 13
``````

### `for` or `apply`?

A `for` loop is used to apply the same function calls to a collection of objects. R has a family of functions, the `apply` family, which can be used in much the same way. You’ve already used one of the family, `apply` in the first lesson. The `apply` family members include

• `apply` - apply over the margins of an array (e.g. the rows or columns of a matrix)
• `lapply` - apply over an object and return list
• `sapply` - apply over an object and return a simplified object (an array) if possible
• `vapply` - similar to `sapply` but you specify the type of object returned by the iterations

Each of these has an argument `FUN` which takes a function to apply to each element of the object. Instead of looping over `filenames` and calling `analyze`, as you did earlier, you could `sapply` over `filenames` with `FUN = analyze`:

``````sapply(filenames, FUN = analyze)
``````

Deciding whether to use `for` or one of the `apply` family is really personal preference. Using an `apply` family function forces to you encapsulate your operations as a function rather than separate calls with `for`. `for` loops are often more natural in some circumstances; for several related operations, a `for` loop will avoid you having to pass in a lot of extra arguments to your function.

### Loops in R Are Slow

No, they are not! If you follow some golden rules:

1. Don’t use a loop when a vectorized alternative exists
2. Don’t grow objects (via `c`, `cbind`, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column
3. Allocate an object to hold the results and fill it in during the loop

As an example, we’ll create a new version of `analyze` that will return the mean inflammation per day (column) of each file.

``````analyze2 <- function(filenames) {
for (f in seq_along(filenames)) {
res <- apply(fdata, 2, mean)
if (f == 1) {
out <- res
} else {
# The loop is slowed by this call to cbind that grows the object
out <- cbind(out, res)
}
}
return(out)
}

system.time(avg2 <- analyze2(filenames))
``````
``````   user  system elapsed
0.027   0.000   0.026
``````

Note how we add a new column to `out` at each iteration? This is a cardinal sin of writing a `for` loop in R.

Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results. Then we loop over the files but this time we fill in the `f`th column of our results matrix `out`. This time there is no copying/growing for R to deal with.

``````analyze3 <- function(filenames) {
out <- matrix(ncol = length(filenames), nrow = 40) # assuming 40 here from files
for (f in seq_along(filenames)) {
out[, f] <- apply(fdata, 2, mean)
}
return(out)
}

system.time(avg3 <- analyze3(filenames))
``````
``````   user  system elapsed
0.024   0.000   0.024
``````

In this simple example there is little difference in the compute time of `analyze2` and `analyze3`. This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations. If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.

Note that `apply` handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to `apply`. At its heart, `apply` is just a `for` loop with extra convenience.

## Key Points

• Where possible, use vectorized operations instead of `for` loops to make code faster and more concise.

• Use functions such as `apply` instead of `for` loops to operate on the values in a data structure.