Programming with MATLAB

Working With Variables

Overview

Teaching: 60 min
Exercises: 5 min
Questions
  • How can I store values and do simple calculations with them?

Objectives
  • Navigate among important sections of the MATLAB environment.

  • Assign values to variables.

  • Identify what type of data is stored in MATLAB arrays.

  • Read tabular data from a file into a program.

Introduction to the MATLAB GUI

Before we can start programming, we need to know a little about the MATLAB interface. Using the default setup, the MATLAB desktop contains several important sections:

Working with variables

In this lesson we will learn how to manipulate the inflammation dataset with MATLAB. But before we discuss how to deal with many data points, we will show how to store a single value on the computer.

We can create a new variable by assigning a value to it using =:

>> weight_kg = 55;

At first glance nothing appears to have happened! We don’t get any output in the command window because we put a semi-colon after the variable assignment: this suppresses output, which is generally a good thing because it makes code run more quickly. Let’s run the command again without the semi-colon, and this time we have some output in the command window:

>> weight_kg = 55
weight_kg =
    55

A variable is just a name for a piece of data or value. Variable names must begin with a letter, and are case sensitive. They can contain also numbers or underscores. Examples of valid variable names are x, f_0 or current_temperature.

Once a variable has a value, we can print it using the disp function:

>> disp(weight_kg)
    55

or simply typing its name, followed by Enter

>> weight_kg
weight_kg =
    55

Storing single values is fine, but how can we store multiple values in the same variable? We can create an array using square brackets, separating each value with a comma:

>> a = [1, 2, 3]
a =
     1     2     3

In a similar way, we can create matrices using semi-colons to separate rows:

>> b = [a; 4, 5, 6]
b =
     1     2     3
     4     5     6

Something to bear in mind about arrays and matrices is that all values in an array must be of the same type e.g. all numbers or all strings. It is however possible to convert between data types e.g. num2str which converts numbers to a string representation.

>> num2str(a)
ans =
    '1  2  3'

So once we have a numeric value stored in a variable, we can do arithmetic with it:

>> weight_lb = 2.2 * weight_kg;
>> disp(['Weight in pounds: ', num2str(weight_lb)])
Weight in pounds: 121

That last command combines several new concepts, so let’s break it down:

The disp function takes a single argument — the value to print. So if we want to print more than one value on a single line, we can print an array of values (i.e. one argument), which we create using square brackets, and recall that an array must contain values all of the same type. In this case we convert the number to a string so that we can print an array of characters.

We can change the value of a variable by assigning it a new one:

>> weight_kg = 57.5
weight_kg =
    57.5

Assigning a value to one variable does not change the values of other variables.

For example, we just changed the value of weight_kg from 55 to 57.5, but weight_lb hasn’t changed:

>> weight_lb
weight_lb =
    121

Since weight_lb doesn’t “remember” where its value came from, it isn’t automatically updated when weight_kg changes. This is important to remember, and different from the way spreadsheets work.

Now that we know how to assign values to variables, let’s view a list of all the variables in our workspace:

>> who
Your variables are:

a  b  weight_kg  weight_lb

To remove a variable from MATLAB, use the clear command:

>> clear weight_lb
>> who
Your variables are:

a  b  weight_kg

Alternatively, we can look at the Workspace. The workspace contains all variable names and assigned values that we currently work with. As long as they pop up in the workspace, they are universally available. It’s generally a good idea to keep the workspace as clean as possible. To remove all variables from the workspace, execute the command clear on its own.

Predicting Variable Values

Predict what variables refer to what values after each statement in the following program:

>> mass = 47.5
>> age = 122
>> mass = mass * 2.0
>> age = age - 20

Solution

The first two lines assign the initial values to the variables, so mass = 47.5 and age = 122. The next line evaluates mass * 2.0 i.e. 47.5 * 2.0 = 95, then assigns the result to the variable mass. The last line evaulates age - 20 i.e. 122 - 20, then assigns the result to the variable age. So the final values are mass = 95, and age = 102.

The key point to understand here is that the expression to the right of the = sign is evaluated first, and the result is then assigned to the variable specified to the left of the = sign.

Good practices for project organisation

Before we get started, let’s create some directories to help organise this project.

Tip: Good Enough Practices for Scientific Computing

Good Enough Practices for Scientific Computing is a paper written by researchers involved with the Carpentries, which covers basic workflow skills for research computing. It recommends the following for project organization:

  1. Put each project in its own directory, which is named after the project.
  2. Put text documents associated with the project in the doc directory.
  3. Put raw data and metadata in the data directory, and files generated during clean-up and analysis in a results directory.
  4. Put source code for the project in the src directory, and programs brought in from elsewhere or compiled locally in the bin directory.
  5. Name all files to reflect their content or function.

We already have a data directory in our matlab-novice-inflammation project directory, so we only need to create a results directory for this project. You can use your computer’s file browser to create this directory. We’ll save all our scripts and function files in the main project directory.

A final step is to set the current folder in MATLAB to our project folder. Use the Current Folder window in the MATLAB GUI to browse to your project folder (matlab-novice-inflammation).

In order to check the current directory, we can run pwd (print working directory). A second check we can do is to run the ls (list) command in the Command Window to list the contents of the working directory — we should get the following output:

data   results

Reading data from files and writing data to them are essential tasks in scientific computing, and admittedly, something that we’d rather not spend a lot of time thinking about. Fortunately, MATLAB comes with a number of high-level tools to do these things efficiently, sparing us the grisly detail.

If we know what our data looks like (in this case, we have a matrix stored as comma-separated values) and we’re unsure about what command we want to use, we can search the documentation. Type read matrix into the documentation toolbar. MATLAB suggests using readmatrix. If we have a closer look at the documentation, MATLAB also tells us, which in- and output arguments this function has.

To load the data from our CSV file into MATLAB, type the following command into the MATLAB command window, and press Enter:

>> patient_data = readmatrix('data/inflammation-01.csv');

This loads the data and assigns it to a variable, patient_data. This is a good example of when to use a semi-colon to suppress output — try re-running the command without the semi-colon to find out why. You should see a wall of numbers printed, which is the data from the file.

>> patient_data = readmatrix('data/inflammation-01.csv')

The expression readmatrix(...) is a function call. Functions generally need arguments to run. In the case of the readmatrix function, we need to provide a single argument: the name of the file we want to read data from. This argument needs to be a character string or string, so we put it in quotes.

Now that our data is in memory, we can start doing things with it. First, let’s find out its size:

>> size(patient_data)
ans =

    60 40

The output tells us that the variable patient_data refers to a table of values that has 60 rows and 40 columns.

MATLAB stores all data in the form of multi-dimensional arrays. For example:

Normally, MATLAB arrays can’t store elements of different data types. For instance, a MATLAB array can’t store both a float and a char. To do that, you have to use a Cell Array.

We can use the class function to find out what type of data lives inside an array:

>> class(patient_data)
ans =
    'double'

This output tells us that patient_data contains double precision floating-point numbers. This is the default numeric data type in MATLAB. If you want to store other numeric data types, you need to tell MATLAB explicitly. For example, the command,

>> x = int16(325);

assigns the value 325 to the name x, storing it as a 16-bit signed integer.

Key Points

  • MATLAB stores data in arrays.

  • Use readmatrix to read tabular CSV data into a program.


Arrays

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • How can I access subsets of data?

Objectives
  • Select individual values and subsections from data.

Array indexing

Now that we understand what kind of data can be stored in an array, we need to learn the proper syntax for working with arrays in MATLAB. To do this we will begin by discussing array indexing, which is the method by which we can select one or more different elements of an array. A solid understanding of array indexing will greatly assist our ability to organize our data.

Let’s start by creating an 8-by-8 “magic” Matrix:

>> M = magic(8)
ans =

   64    2    3   61   60    6    7   57
    9   55   54   12   13   51   50   16
   17   47   46   20   21   43   42   24
   40   26   27   37   36   30   31   33
   32   34   35   29   28   38   39   25
   41   23   22   44   45   19   18   48
   49   15   14   52   53   11   10   56
    8   58   59    5    4   62   63    1

We want to access a single value from the matrix:

Accessing a single value

To do that, we must provide its index in parentheses:

>> M(5, 6)
ans = 38

Indices are provided as (row, column). So the index (5, 6) selects the element on the fifth row and sixth column.

An index like (5, 6) selects a single element of an array, but we can also access sections of the matrix, or slices. To access a row of values:

Accessing a single value

we can do:

>> M(5, :)
ans =

   32   34   35   29   28   38   39   25

Providing : as the index for a dimension selects all elements along that dimension. So, the index (5, :) selects the elements on row 5, and all columns—effectively, the entire row. We can also select multiple rows,

Accessing multiple rows

>> M(1:4, :)
ans =

   64    2    3   61   60    6    7   57
    9   55   54   12   13   51   50   16
   17   47   46   20   21   43   42   24
   40   26   27   37   36   30   31   33

and columns:

Accessing multiple columns

>> M(:, 6:end)
ans =

    6    7   57
   51   50   16
   43   42   24
   30   31   33
   38   39   25
   19   18   48
   11   10   56
   62   63    1

To select a submatrix,

Accessing a submatrix

we have to take slices in both dimensions:

>> M(4:6, 5:7)
ans =

   36   30   31
   28   38   39
   45   19   18

We don’t have to take all the values in the slice—if we provide a stride. Let’s say we want to start with row 2, and subsequently select every third row:

Accessing strided columns

>> M(2:3:end, :)
ans =

    9   55   54   12   13   51   50   16
   32   34   35   29   28   38   39   25
    8   58   59    5    4   62   63    1

And we can also select values in a “checkerboard”,

Accessing strided rows and columns

by taking appropriate strides in both dimensions:

>> M(1:3:end, 2:2:end)
ans =

    2   61    6   57
   26   37   30   33
   15   52   11   56

Slicing

A subsection of an array is called a slice. We can take slices of character strings as well:

>> element = 'oxygen';
>> disp(['first three characters: ', element(1:3)])
>> disp(['last three characters: ', element(4:6)])
first three characters: oxy
last three characters: gen
  1. What is the value of element(4:end)? What about element(1:2:end)? Or element(2:end - 1)?

  2. For any size array, MATLAB allows us to index with a single colon operator (:). This can have surprising effects. For instance, compare element with element(:). What is size(element) versus size(element(:))? Finally, try using the single colon on the matrix M above: M(:). What seems to be happening when we use the single colon operator for slicing?

Solution

  1. Exercises using slicing

     element(4:end)   % Select all elements from 4th to last
     ans =
         'gen'
     element(1:2:end) % Select every other element starting at first
     ans =
         'oye
     element(2:end-1) % Select elements starting with 2nd, until last-but-one
     ans =
         'xyge'
    
  2. The colon operator ‘flattens’ a vector or matrix into a column vector. The order of the elements in the resulting vector comes from appending each column of the original array in turn. Have a look at the order of the values in M(:) vs M

Key Points

  • M(row, column) indices are used to select data points

  • : is used to take slices of data


Plotting data

Overview

Teaching: 30 min
Exercises: 5 min
Questions
  • How can I process and visualize my data?

Objectives
  • Perform operations on arrays of data.

  • Display simple graphs.

Analyzing patient data

Now that we know how to access data we want to compute with, we’re ready to analyze patient_data. MATLAB knows how to perform common mathematical operations on arrays. If we want to find the average inflammation for all patients on all days, we can just ask for the mean of the array:

>> mean(patient_data(:))
ans = 6.1487

We couldn’t just do mean(patient_data) because, that would compute the mean of each column in our table, and return an array of mean values. The expression patient_data(:) flattens the table into a one-dimensional array.

To get details about what a function, like mean, does and how to use it, we can search the documentation, or use MATLAB’s help command.

>> help mean
mean   Average or mean value.
    S = mean(X) is the mean value of the elements in X if X is a vector. 
    For matrices, S is a row vector containing the mean value of each 
    column. 
    For N-D arrays, S is the mean value of the elements along the first 
    array dimension whose size does not equal 1.
 
    mean(X,DIM) takes the mean along the dimension DIM of X.
 
    S = mean(...,TYPE) specifies the type in which the mean is performed, 
    and the type of S. Available options are:
 
    'double'    -  S has class double for any input X
    'native'    -  S has the same class as X
    'default'   -  If X is floating point, that is double or single,
                   S has the same class as X. If X is not floating point, 
                   S has class double.
 
    S = mean(...,NANFLAG) specifies how NaN (Not-A-Number) values are 
    treated. The default is 'includenan':
 
    'includenan' - the mean of a vector containing NaN values is also NaN.
    'omitnan'    - the mean of a vector containing NaN values is the mean 
                   of all its non-NaN elements. If all elements are NaN,
                   the result is NaN.
 
    Example:
        X = [1 2 3; 3 3 6; 4 6 8; 4 7 7]
        mean(X,1)
        mean(X,2)
 
    Class support for input X:
       float: double, single
       integer: uint8, int8, uint16, int16, uint32,
                int32, uint64, int64
 
    See also median, std, min, max, var, cov, mode.

We can also compute other statistics, like the maximum, minimum and standard deviation.

>> disp('Maximum inflammation:')
>> disp(max(patient_data(:)))
>> disp('Minimum inflammation:')
>> disp(min(patient_data(:)))
>> disp('Standard deviation:')
>> disp(std(patient_data(:)))
Maximum inflammation:
20
Minimum inflammation:
0
Standard deviation:
4.6148

When analyzing data though, we often want to look at partial statistics, such as the maximum value per patient or the average value per day. One way to do this is to assign the data we want to a new temporary array, then ask it to do the calculation:

>> patient_1 = patient_data(1, :)
>> disp('Maximum inflation for patient 1:')
>> disp(max(patient_1))
Maximum inflation for patient 1:
18

We don’t actually need to store the row in a variable of its own. Instead, we can combine the selection and the function call:

>> max(patient_data(1, :))
ans = 18

What if we need the maximum inflammation for all patients, or the average for each day? As the diagram below shows, we want to perform the operation across an axis:

Operations Across Axes

To support this, MATLAB allows us to specify the dimension we want to work on. If we ask for the average across the dimension 1, we’re asking for one summary value per column, which is the average of all the rows. In other words, we’re asking for the average inflammation per day for all patients.

>> mean(patient_data, 1)
ans =
  Columns 1 through 10
         0    0.5833    0.9833    1.6667    2.5333    3.0667    3.4667    3.9000    5.2333    5.1833
  Columns 11 through 20
    6.2167    6.4000    7.2167    8.0500    8.8667    9.6333   10.1833   10.1500   10.4167    9.8667
  Columns 21 through 30
   12.2000   12.1000   11.1833   11.0333   10.0000    8.5167    8.3833    8.5167    8.1167    6.2500
  Columns 31 through 40
    5.7667    5.6000    5.1333    3.9000    3.7000    2.7833    2.5500    1.2833    0.9667    0.6000

If we average across axis 2, we’re asking for one summary value per row, which is the average of all the columns. In other words, we’re asking for the average inflammation per patient over all the days:

>> mean(patient_data, 2)
ans =

   5.4500
   5.4250
   6.1000
   5.9000
   5.5500
   6.2250
   5.9750
   6.6500
   6.6250
   6.5250
   6.7750
   5.8000
   6.2250
   5.7500
   5.2250
   6.3000
   6.5500
   5.7000
   5.8500
   6.5500
   5.7750
   5.8250
   6.1750
   6.1000
   5.8000
   6.4250
   6.0500
   6.0250
   6.1750
   6.5500
   6.1750
   6.3500
   6.7250
   6.1250
   7.0750
   5.7250
   5.9250
   6.1500
   6.0750
   5.7500
   5.9750
   5.7250
   6.3000
   5.9000
   6.7500
   5.9250
   7.2250
   6.1500
   5.9500
   6.2750
   5.7000
   6.1000
   6.8250
   5.9750
   6.7250
   5.7000
   6.2500
   6.4000
   7.0500
   5.9000

We can quickly check the size of this array:

>> size(mean(patient_data, 2))
ans =
    60    1

The size tells us we have a 60-by-1 vector, confirming that this is the average inflammation per patient over all days in the trial.

Plotting

The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers,” and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of MATLAB here.

Let’s display a heat map of our data:

>> imagesc(patient_data)
>> title('Inflammation')
>> xlabel('Day of trial')
>> ylabel('Patient number')

Heat map

The imagesc function represents the matrix as a color image. Every value in the matrix is mapped to a color. Blue regions in this heat map are low values, while yellow shows high values. As we can see, inflammation rises and falls over a 40 day period.

It’s good practice to give the figure a title, and to label the axes using xlabel and ylabel so that other people can understand what it shows (including us if we return to this plot 6 months from now).

Let’s take a look at the average inflammation over time:

>> plot(mean(patient_data, 1))
>> title('Daily average inflammation')
>> xlabel('Day of trial')
>> ylabel('Inflammation')

Average inflammation

Here, we have calculated the average per day across all patients then used the plot function to display a line graph of those values. The result is roughly a linear rise and fall, which is suspicious: based on other studies, we expect a sharper rise and slower fall. Let’s have a look at two other statistics: the maximum and minimum inflammation per day across all patients.

>> plot(max(patient_data, [], 1))
>> title('Maximum inflammation per day')
>> ylabel('Inflammation')
>> xlabel('Day of trial')

Maximum inflammation

>> plot(min(patient_data, [], 1))
>> title('Minimum inflammation per day')
>> ylabel('Inflammation')
>> xlabel('Day of trial')

Minumum inflammation

Like mean(), the functions max() and min() can also operate across a specified dimension of the matrix. However, the syntax is slightly different. To see why, run a help on each of these functions.

From the figures, we see that the maximum value rises and falls perfectly smoothly, while the minimum seems to be a step function. Neither result seems particularly likely, so either there’s a mistake in our calculations or something is wrong with our data.

Plots

When we plot just one variable using the plot command e.g. plot(Y) or plot(mean(patient_data, 1)), what do the x-values represent?

Solution

The x-values are the indices of the y-data, so the first y-value is plotted against index 1, the second y-value against 2 etc.

Why are the vertical lines in our plot of the minimum inflammation per day not perfectly vertical?

Solution

MATLAB interpolates between the points on a 2D line plot.

Create a plot showing the standard deviation of the inflammation data for each day across all patients. Hint: search the documentation for standard deviation

Solution

>> plot(std(patient_data, 0, 1))
>> xlabel('Day of trial')
>> ylabel('Inflammation')
>> title('Standard deviation across all patients')

It is often convenient to combine multiple plots into one figure using the subplotcommand which plots our graphs in a grid pattern. The first two parameters describe the grid we want to use, while the third parameter indicates the placement on the grid.

>> subplot(1, 2, 1)
>> plot(max(patient_data, [], 1))
>> ylabel('max')
>> xlabel('day')

>> subplot(1, 2, 2)
>> plot(min(patient_data, [], 1))
>> ylabel('min')
>> xlabel('day')

Max Min subplot

Our work so far has convinced us that something is wrong with our first data file. We would like to check the other 11 the same way, but typing in the same commands repeatedly is tedious and error-prone. Since computers don’t get bored (that we know of), we should create a way to do a complete analysis with a single command, and then figure out how to repeat that step once for each file. These operations are the subjects of the next two lessons.

Key Points

  • Use plot to visualize data.


Writing MATLAB Scripts

Overview

Teaching: 35 min
Exercises: 0 min
Questions
  • How can I save and re-use my programs?

Objectives
  • Write and save MATLAB scripts.

  • Save MATLAB plots to disk.

So far, we’ve typed in commands one-by-one on the command line to get MATLAB to do things for us. But what if we want to repeat our analysis? Sure, it’s only a handful of commands, and typing them in shouldn’t take us more than a few minutes. But if we forget a step or make a mistake, we’ll waste time rewriting commands. Also, we’ll quickly find ourselves doing more complex analyses, and we’ll need our results to be more easily reproducible.

In addition to running MATLAB commands one-by-one on the command line, we can also write several commands in a script. A MATLAB script is just a text file with a .m extension. We’ve written commands to load data from a .csv file and display some plots of statistics about that data. Let’s put those commands in a script called plot_patient1.m, which we’ll save in our current directory,matlab-novice-inflammation.

To create a new script in the current directory, we use

edit plot_patient1.m

then we type the contents of the script:

patient_data = readmatrix('data/inflammation-01.csv');

% Plot average inflammation per day
figure
plot(mean(patient_data, 1))
title('Daily average inflammation')
xlabel('Day of trial')
ylabel('Inflammation')

Note that we are explicitly creating a new figure window using the figure command.

Try this on the command line:

figure

MATLAB’s plotting commands only create a new figure window if one doesn’t already exist: the default behaviour is to reuse the current figure window as we saw in the previous episode. Explicitly creating a new figure window in the script avoids any unexpected results from plotting on top of existing figures.

You can get MATLAB to run the commands in the script by typing in the name of the script (without the .m) in the MATLAB command line:

plot_patient1

The MATLAB path

MATLAB knows about files in the current directory, but if we want to run a script saved in a different location, we need to make sure that this file is visible to MATLAB. We do this by adding directories to the MATLAB path. The path is a list of directories MATLAB will search through to locate files.

To add a directory to the MATLAB path, we go to the Home tab, click on Set Path, and then on Add with Subfolders.... We navigate to the directory and add it to the path to tell MATLAB where to look for our files. When you refer to a file (either code or data), MATLAB will search all the directories in the path to find it. Alternatively, for data files, we can provide the relative or absolute file path.

GNU Octave

Octave has only recently gained a MATLAB-like user interface. To change the path in any version of Octave, including command-line-only installations, use addpath('path/to/directory')

In this script, let’s save the figures to disk as image files using the print command. In order to maintain an organised project we’ll save the images in the results directory:

% Plot average inflammation per day
figure
plot(mean(patient_data, 1))
title('Daily average inflammation')
xlabel('Day of trial')
ylabel('Inflammation')

% Save plot in 'results' folder as png image:
print('results/average','-dpng')

Help text

You might have noticed that we described what we want our code to do using the percent sign: %. This is another plus of writing scripts: you can comment your code to make it easier to understand when you come back to it after a while.

A comment can appear on any line, but be aware that the first line or block of comments in a script or function is used by MATLAB as the help text. When we use the help command, MATLAB returns the help text. The first help text line (known as the H1 line) typically includes the name of the program, and a brief description. The help command works in just the same way for our own programs as for built-in MATLAB functions. You should write help text for all of your own scripts and functions.

Let’s write an H1 line at the top of our script:

%PLOT_PATIENT1   Save plots of inflammation statistics to disk.

We can then get help for our script by running

help plot_patient1

Let’s modify our plot_patient1 script so that it creates and saves sub-plots, rather than individual plots. As before we’ll save the images in the results directory.

%PLOT_PATIENT1   Save plots of inflammation statistics to disk.

patient_data = readmatrix('data/inflammation-01.csv');

% Plot inflammation stats for first patient
figure
subplot(1, 3, 1)
plot(mean(patient_data, 1))
title('Average')
ylabel('Inflammation')
xlabel('Day')

subplot(1, 3, 2)
plot(max(patient_data, [], 1))
title('Max')
ylabel('Inflammation')
xlabel('Day')

subplot(1, 3, 3)
plot(min(patient_data, [], 1))
title('Min')
ylabel('Inflammation')
xlabel('Day')

% Save plot in 'results' directory as png image.
print('results/inflammation-01','-dpng')

When saving plots to disk, it’s sometimes useful to turn off their visibility as MATLAB plots them. For example, we might not want to view (or spend time closing) the figures in MATLAB, and not displaying the figures could make the script run faster.

Let’s add a couple of lines of code to do this:

%PLOT_PATIENT1   Save plots of inflammation statistics to disk.

patient_data = readmatrix('data/inflammation-01.csv');

% Plot inflammation stats for first patient
figure('visible', 'off')

subplot(1, 3, 1)
plot(mean(patient_data, 1))
title('Average')
ylabel('inflammation')
xlabel('Day')

subplot(1, 3, 2)
plot(max(patient_data, [], 1))
title('Max')
ylabel('Inflammation')
xlabel('Day')

subplot(1, 3, 3)
plot(min(patient_data, [], 1))
title('Min')
ylabel('Inflammation')
xlabel('Day')

% Save plot in 'results' directory as png image.
print('results/inflammation-01','-dpng')

close()

We can ask MATLAB to create an empty figure window without displaying it by setting its 'visible' property to 'off', like so:

figure('visible', 'off')

When we do this, we have to be careful to manually “close” the figure after we are doing plotting on it - the same as we would “close” an actual figure window if it were open:

close()

Key Points

  • Save MATLAB code in files with a .m suffix.


Repeating With Loops

Overview

Teaching: 40 min
Exercises: 10 min
Questions
  • How can I repeat the same operations on multiple values?

Objectives
  • Explain what a for loop does.

  • Correctly write for loops that repeat simple commands.

  • Trace changes to a loop variable as the loops runs.

  • Use a for loop to process multiple files

Recall that we have to do this analysis for every one of our dozen datasets, and we need a better way than typing out commands for each one, because we’ll find ourselves writing a lot of duplicate code. Remember, code that is repeated in two or more places will eventually be wrong in at least one. Also, if we make changes in the way we analyze our datasets, we have to introduce that change in every copy of our code. To avoid all of this repetition, we have to teach MATLAB to repeat our commands, and to do that, we have to learn how to write loops.

Suppose we want to print each character in the word “lead” on a line of its own. One way is to use four disp statements:

%LOOP_DEMO   Demo script to explain loops

word = 'lead';

disp(word(1))
disp(word(2))
disp(word(3))
disp(word(4))
l
e
a
d

But this is a bad approach for two reasons:

  1. It doesn’t scale: if we want to print the characters in a string that’s hundreds of letters long, we’d be better off typing them in.

  2. It’s fragile: if we change word to a longer string, it only prints part of the data, and if we change it to a shorter one, it produces an error, because we’re asking for characters that don’t exist.

%LOOP_DEMO   Demo script to explain loops

word = 'tin';

disp(word(1))
disp(word(2))
disp(word(3))
disp(word(4))
error: A(I): index out of bounds; value 4 out of bound 3

There’s a better approach:

%LOOP_DEMO   Demo script to explain loops

word = 'lead';

for letter = 1:4
    disp(word(letter))
end
l
e
a
d

This improved version uses a for loop to repeat an operation—in this case, printing to the screen—once for each element in an array.

The general form of a for loop is:

for variable = collection
    do things with variable
end

The for loop executes the commands in the loop body for every value in the array collection. This value is called the loop variable, and we can call it whatever we like. In our example, we gave it the name letter.

We have to terminate the loop body with the end keyword, and we can have as many commands as we like in the loop body. But, we have to remember that they will all be repeated as many times as there are values in collection.

Our for loop has made our code more scalable, and less fragile. There’s still one little thing about it that should bother us. For our loop to deal appropriately with shorter or longer words, we have to change the first line of our loop by hand:

%LOOP_DEMO   Demo script to explain loops

word = 'tin';

for letter = 1:3
    disp(word(letter))
end
t
i
n

Although this works, it’s not the best way to write our loop:

Fortunately, MATLAB provides us with a convenient function to write a better loop:

%LOOP_DEMO   Demo script to explain loops

word = 'aluminum';

for letter = 1:length(word)
    disp(word(letter))
end
a
l
u
m
i
n
u
m

This is much more robust code, as it can deal identically with words of arbitrary length. Loops are not only for working with strings, they allow us to do repetitive calculations regardless of data type. Here’s another loop that calculates the sum of all even numbers between 1 and 10:

%LOOP_DEMO   Demo script to explain loops

total = 0;
for even_number = 2 : 2 : 10
    total = total + even_number;
end

disp('The sum of all even numbers between 1 and 10 is:')
disp(total)

It’s worth tracing the execution of this little program step by step.

The debugger

We can use the MATLAB debugger to trace the execution of a program.

The first step is to set a break point by clicking just to the right of a line number on the - symbol. A red circle will appear — this is the break point, and when we run the script, MATLAB will pause execution at that line.

A green arrow appears, pointing to the next line to be run. To continue running the program one line at a time, we use the step button.

We can then inspect variables in the workspace or by hovering the cursor over where they appear in the code, or get MATLAB to evaluate expressions in the command window (notice the prompt changes to K>>).

This process is useful to check your understanding of a program, in order to correct mistakes.

This process is illustrated below: debugger-demo

Since we want to sum only even numbers, the loop index even_number starts at 2 and increases by 2 with every iteration. When we enter the loop, total is zero - the value assigned to it beforehand. The first time through, the loop body adds the value of the first even number (2) to the old value of total (0), and updates total to refer to that new value. On the next loop iteration, even_number is 4 and the initial value of total is 2, so the new value assigned to total is 6. After even_number reaches the final value (10), total is 30; since this is the end of the range for even_number the loop finishes and the disp statements give us the final answer.

Note that a loop variable is just a variable that’s being used to record progress in a loop. It still exists after the loop is over, and we can re-use variables previously defined as loop variables as well:

>> disp(even_number)
10

Performing Exponentiation

MATLAB uses the caret (^) to perform exponentiation:

>> disp(5^3)
125

You can also use a loop to perform exponentiation. Remember that b^x is just b*b*b*x times.

Let a variable b be the base of the number and x the exponent. Write a loop to compute b^x. Check your result for b = 4 and x = 5.

Solution

% Loop to perform exponentiation
b = 4;    % base
x = 5;    % exponent

result=1;
for i = 1:x
    result = result * b;
end

disp([num2str(b), '^', num2str(x), ' = ', num2str(result)])

Incrementing with Loops

Write a loop that spells the word “aluminum,” adding one letter at a time:

a
al
alu
alum
alumi
alumin
aluminu
aluminum

Solution

% spell a string adding one letter at a time using a loop

word = 'aluminium';

for letter = 1:length(word)
    disp(word(1:letter))
end

Looping in Reverse

In MATLAB, the colon operator (:) accepts a stride or skip argument between the start and stop:

>> disp(1:3:11)
1 4 7 10
>> disp(11:-3:1)
11 8 5 2

Using this, write a loop to print the letters of “aluminum” in reverse order, one letter per line.

m
u
n
i
m
u
l
a

Solution

% Spell a string in reverse using a loop

word = 'aluminium';

for letter = length(word):-1:1
    disp(word(letter))
end

Analyzing patient data from multiple files

We now have almost everything we need to process multiple data files using a loop and the plotting code in our plot_patient1 script.

We still need to generate a list of data files to process, and then we can use a loop to repeat the analysis for each file.

We can use the dir command to return a structure array containing the names of the files in the data directory. Each element in this structure array is a structure, containing information about a single file in the form of named fields.

>> files = dir('data/inflammation-*.csv')
files =
  12×1 struct array with fields:
    name
    folder
    date
    bytes
    isdir
    datenum

To access the name field of the first file, we can use the following syntax:

>> filename = files(1).name;
>> disp(filename)
inflammation-01.csv

To get the modification date of the third file, we can do:

>> mod_date = files(3).date;
>> disp(mod_date)
26-Jul-2015 22:24:31

A good first step towards processing multiple files is to write a loop which prints the name of each of our files. Let’s write this in a script plot_all.m which we will then develop further:

%PLOT_ALL	Developing code to automate inflammation analysis

files = dir('data/inflammation-*.csv');

for i = 1:length(files)
	file_name = files(i).name;
	disp(file_name)
end
>> plot_all
inflammation-01.csv
inflammation-02.csv
inflammation-03.csv
inflammation-04.csv
inflammation-05.csv
inflammation-06.csv
inflammation-07.csv
inflammation-08.csv
inflammation-09.csv
inflammation-10.csv
inflammation-11.csv
inflammation-12.csv

Another task is to generate the file names for the figures we’re going to save. Let’s name the output file after the data file used to generate the figure. So for the data set inflammation-01.csv we will call the figure inflammation-01.png. We can use the replace command for this purpose.

The syntax for the replace command is like this:

NEWSTR = replace(STR, OLD, NEW)

So for example if we have the string big_shark and want to get the string terror_shark, we can execute the following command:

>> new_string = replace('big_shark', 'big', 'terror');
>> disp(new_string)
terror_shark

GNU Octave

In Octave, the replace function doesn’t exist, but the strrep function is a direct replacement. The above example becomes

>> new_string = strep('big_shark', 'big', 'terror')
terror_shark

Recall that we’re saving our figures to the results directory. The best way to generate a path to a file in MATLAB is by using the fullfile command. This generates a file path with the correct separators for the platform you’re using (i.e. forward slash for Linux and macOS, and backslash for Windows). This makes your code more portable which is great for collaboration.

Putting these concepts together, we can now generate the paths for the data files, and the image files we want to save:

%PLOT_ALL	Developing code to automate inflammation analysis

files = dir('data/inflammation-*.csv');

for i = 1:length(files)
    file_name = files(i).name;

    % Generate string for image name
    img_name = replace(file_name, '.csv', '.png');

    % Generate path to data file and image file
    file_name = fullfile('data', file_name);
    img_name = fullfile('results',img_name);

    disp(file_name)
    disp(img_name)
end
data/inflammation-01.csv
results/inflammation-01.png
data/inflammation-02.csv
results/inflammation-02.png
data/inflammation-03.csv
results/inflammation-03.png
data/inflammation-04.csv
results/inflammation-04.png
data/inflammation-05.csv
results/inflammation-05.png
data/inflammation-06.csv
results/inflammation-06.png
data/inflammation-07.csv
results/inflammation-07.png
data/inflammation-08.csv
results/inflammation-08.png
data/inflammation-09.csv
results/inflammation-09.png
data/inflammation-10.csv
results/inflammation-10.png
data/inflammation-11.csv
results/inflammation-11.png
data/inflammation-12.csv
results/inflammation-12.png

We’re now ready to modify plot_all.m to actually process multiple data files:

%PLOT_ALL   Print statistics for all patients.
%           Save plots of statistics to disk.

files = dir('data/inflammation-*.csv');

% Process each file in turn
for i = 1:length(files)
    file_name = files(i).name;

    % Generate strings for image names:
    img_name  = replace(file_name, '.csv', '.png');

    % Generate path to data file and image file
    file_name = fullfile('data', file_name);
    img_name  = fullfile('results', img_name);

    patient_data = readmatrix(file_name);

    % Create figures
    figure('visible', 'off')

    subplot(2, 2, 1)
    plot(mean(patient_data, 1))
    title('Average')
    ylabel('Inflammation')
    xlabel('Day')

    subplot(2, 2, 2)
    plot(max(patient_data, [], 1))
    title('Max')
    ylabel('Inflammation')
    xlabel('Day')

    subplot(2, 2, 3)
    plot(min(patient_data, [], 1))
    title('Min')
    ylabel('Inflammation')
    xlabel('Day')

    print(img_name, '-dpng')
    close()
end

We run the modified script using its name in the Command Window:

>> plot_all

The first three figures output to the results directory are as shown below:

Sure enough, the maxima of these data sets show exactly the same ramp as the first, and their minima show the same staircase structure.

We’ve now automated the analysis and have confirmed that all the data files we have looked at show the same artifact. This is what we set out to test, and now we can just call one script to do it. With minor modifications, this script could be re-used to check all our future data files.

Key Points

  • Use for to create a loop that repeats one or more operations.


Making Choices

Overview

Teaching: 35 min
Exercises: 5 min
Questions
  • How can programs do different things for different data values?

Objectives
  • Construct a conditional statement using if, elseif, and else

  • Test for equality within a conditional statement

  • Combine conditional tests using AND and OR

  • Build a nested loop

Our previous lessons have shown us how to manipulate data and repeat things. However, the programs we have written so far always do the same things, regardless of what data they’re given. We want programs to make choices based on the values they are manipulating.

The tool that MATLAB gives us for doing this is called a conditional statement, and it looks like this:

num = 37;

if num > 100
    disp('greater')
else
    disp('not greater')
end

disp('done')
not greater
done

The second line of this code uses the keyword if to tell MATLAB that we want to make a choice. If the test that follows is true, the body of the if (i.e., the lines between if and else) are executed. If the test is false, the body of the else (i.e., the lines between else and end) are executed instead. Only one or the other is ever executed.

Conditional statements don’t have to have an else block. If there isn’t one, MATLAB simply doesn’t do anything if the test is false:

num = 53;
disp('before conditional...')

if num > 100
    disp('53 is greater than 100')
end

disp('...after conditional')
before conditional...
...after conditional

We can also chain several tests together using elseif. This makes it simple to write a script that gives the sign of a number:

%CONDITIONAL_DEMO   Demo script to illustrate use of conditionals

num = 53;

if num > 0
    disp('num is positive')
elseif num == 0
    disp('num is zero')
else
    disp('num is negative')
end

One important thing to notice in the code above is that we use a double equals sign == to test for equality rather than a single equals sign. This is because the latter is used to mean assignment. In our test, we want to check for the equality of num and 0, not assign 0 to num. This convention was inherited from C, and it does take a bit of getting used to…

During a conditional statement, if one of the conditions is true, this marks the end of the test: no subsequent conditions will be tested and execution jumps to the end of the conditional.

Let’s demonstrate this by adding another condition which is true.

% Demo script to illustrate use of conditionals
num = 53;

if num > 0
    disp('num is positive')
elseif num == 0
    disp('num is zero')
elseif num > 50
    % This block will never be executed
    disp('num is greater than 50')
else
    disp('num is negative')
end

We can also combine tests, using && (and) and || (or). && is true if both tests are true:

if ((1 > 0) && (-1 > 0))
    disp('both parts are true')
else
    disp('one part is not true')
end
one part is not true

|| is true if either test is true:

if (1 < 0) || (3 < 4)
    disp('at least one part is true')
end
at least one part is true

In this case, “either” means “either or both”, not “either one or the other but not both”.

True and False Statements

The conditions we have tested above evaluate to a logical value: true or false. However these numerical comparison tests aren’t the only values which are true or false in MATLAB. For example, 1 is considered true and 0 is considered false. In fact, any value can be used in a conditional statement.

Run the code below in order to discover which values are considered true and which are considered false.

if ''
    disp('empty string is true')
end

if 'foo'
    disp('non empty string is true')
end

if []
    disp ('empty array is true')
end

if [22.5, 1.0]
    disp ('non empty array is true')
end

if [0, 0]
    disp ('array of zeros is true')
end

if true
    disp('true is true')
end

Close Enough

Write a script called near that performs a test on two variables, and displays 1 when the first variable is within 10% of the other and 0 otherwise. Compare your implementation with your partner’s: do you get the same answer for all possible pairs of numbers?

Solution

%NEAR   Display 1 if variable a is within 10% of variable b
%       and display 0 otherwise
a = 1.1;
b = 1.2;

if a/b >= 0.9 && a/b <= 1.1
    disp(1)
else
    disp(0)
end

Another thing to realize is that if statements can also be combined with loops. For example, if we want to sum the positive numbers in a list, we can write this:

numbers = [-5, 3, 2, -1, 9, 6];
total = 0;

for n = numbers
    if n >= 0
        total = total + n;
    end
end

disp(['sum of positive values: ', num2str(total)])
sum of positive values: 20

With a little extra effort, we can calculate the positive and negative sums in a loop:

pos_total = 0;
neg_total = 0;

for n = numbers
    if n >= 0
        pos_total = pos_total + n;
    else
        neg_total = neg_total + n;
    end
end

disp(['sum of positive values: ', num2str(pos_total)])
disp(['sum of negative values: ', num2str(neg_total)])
sum of positive values: 26
sum of negative values: -6

We can even put one loop inside another:

for number = 1:3
    for letter = 'ab'
        disp([num2str(number), letter])
    end
end
1a
1b
2a
2b
3a
3b

Nesting

Will changing the order of nesting in the above loop change the output? Why? Write down the output you might expect from changing the order of the loops, then rewrite the code to test your hypothesis.

Solution

for letter = 'ab'
    for number = 1:3
        disp([num2str(number), letter])
    end
end

Reordering the nested loops changes the output. In the new code, the number loop happens within the letter loop, so while letter = a, number takes the values 1, 2, and 3 in turn.

Currently, our script plot_all.m reads in data, analyzes it, and saves plots of the results. If we would rather display the plots interactively, we would have to remove (or comment out) the following code:

print(img_name,'-dpng')
close()

And, we’d also have to change this line of code, from:

figure('visible', 'off')

to:

figure('visible', 'on')
% or equivalently: figure()

This is not a lot of code to change every time, but it’s still work that’s easily avoided using conditionals. Here’s our script re-written to use conditionals to switch between saving plots as images and plotting them interactively:

%PLOT_ALL  Save plots of statistics to disk.
%          Use variable plot_switch to control interactive plotting
%          vs saving images to disk.
%            plot_switch = 0: show plots interactively
%            plot_switch = 1: save plots to disk

plot_switch = 0;

files = dir('data/inflammation-*.csv');

% Process each file in turn
for i = 1:length(files)
    file_name = files(i).name;

    % Generate strings for image names:
    img_name = replace(file_name, '.csv', '.png');

    % Generate path to data file and image file
    file_name = fullfile('data', filename);
    img_name  = fullfile('results', img_name);

    patient_data = readmatrix(file_name);

    % Create figures
    if plot_switch == 1
        figure('visible', 'off')
    else
        figure('visible', 'on')
    end

    subplot(2, 2, 1)
    plot(mean(patient_data, 1))
    title('Average')
    ylabel('Inflammation')
    xlabel('Day')

    subplot(2, 2, 2)
    plot(max(patient_data, [], 1))
    title('Max')
    ylabel('Inflammation')
    xlabel('Day')

    subplot(2, 2, 3)
    plot(min(patient_data, [], 1))
    title('Min')
    ylabel('Inflammation')
    xlabel('Day')

    if plot_switch == 1
        print(img_name, '-dpng')
        close()
    end

end

Key Points

  • Use if and else to make choices based on values in your program.


Creating Functions

Overview

Teaching: 45 min
Exercises: 20 min
Questions
  • How can I teach MATLAB how to do new things?

Objectives
  • Compare and contrast MATLAB function files with MATLAB scripts.

  • Define a function that takes arguments.

  • Test a function.

  • Recognize why we should divide programs into small, single-purpose functions.

If we only had one data set to analyze, it would probably be faster to load the file into a spreadsheet and use that to plot some simple statistics. But we have twelve files to check, and may have more in future. In this lesson, we’ll learn how to write a function so that we can repeat several operations with a single command.

Let’s start by defining a function fahr_to_kelvin that converts temperatures from Fahrenheit to Kelvin:

function ktemp = fahr_to_kelvin(ftemp)
    %FAHR_TO_KELVIN   Convert Fahrenheit to Kelvin

    ktemp = ((ftemp - 32) * (5/9)) + 273.15;
end

A MATLAB function must be saved in a text file with a .m extension. The name of that file must be the same as the function defined inside it. The name must start with a letter and cannot contain spaces. So, you will need to save the above code in a file called fahr_to_kelvin.m. Remember to save your m-files in the current directory.

The first line of our function is called the function definition, and it declares that we’re writing a function named fahr_to_kelvin, that has a single input parameter,ftemp, and a single output parameter, ktemp. Anything following the function definition line is called the body of the function. The keyword end marks the end of the function body, and the function won’t know about any code after end.

function-definition

A function can have multiple input and output parameters if required, but isn’t required to have any of either. The general form of a function is shown in the pseudo-code below:

function [out1, out2] = function_name(in1, in2)
    %FUNCTION_NAME   Function description

    % This section below is called the body of the function
    out1 = something calculated;
    out2 = something else;
end

Just as we saw with scripts, functions must be visible to MATLAB, i.e., a file containing a function has to be placed in a directory that MATLAB knows about. The most convenient of those directories is the current working directory.

GNU Octave

In common with MATLAB, Octave searches the current working directory and the path for functions called from the command line.

We can call our function from the command line like any other MATLAB function:

>> fahr_to_kelvin(32)
ans = 273.15

When we pass a value, like 32, to the function, the value is assigned to the variable ftemp so that it can be used inside the function. If we want to return a value from the function, we must assign that value to a variable named ktemp—in the first line of our function, we promised that the output of our function would be named ktemp.

Outside of the function, the variables ftemp and ktemp aren’t visible; they are only used by the function body to refer to the input and output values.

This is one of the major differences between scripts and functions: a script can be thought of as automating the command line, with full access to all variables in the base workspace, whereas a function can only read and write variables from the calling workspace if they are passed as arguments — i.e. a function has its own separate workspace.

Now that we’ve seen how to convert Fahrenheit to Kelvin, it’s easy to convert Kelvin to Celsius.

function ctemp = kelvin_to_celsius(ktemp)
    %KELVIN_TO_CELSIUS   Convert from Kelvin to Celcius

    ctemp = ktemp - 273.15;
end

Again, we can call this function like any other:

>> kelvin_to_celsius(0.0)
ans = -273.15

What about converting Fahrenheit to Celsius? We could write out the formula, but we don’t need to. Instead, we can compose the two functions we have already created:

function ctemp = fahr_to_celsius(ftemp)
    %FAHR_TO_CELSIUS   Convert Fahrenheit to Celcius

    ktemp = fahr_to_kelvin(ftemp);
    ctemp = kelvin_to_celsius(ktemp);
end

Calling this function,

>> fahr_to_celsius(32.0)

we get, as expected:

ans = 0

This is our first taste of how larger programs are built: we define basic operations, then combine them in ever-larger chunks to get the effect we want. Real-life functions will usually be larger than the ones shown here—typically half a dozen to a few dozen lines—but they shouldn’t ever be much longer than that, or the next person who reads it won’t be able to understand what’s going on.

Concatenating in a Function

In MATLAB, we concatenate strings by putting them into an array or using the strcat function:

>> disp(['abra', 'cad', 'abra'])
abracadabra
>> disp(strcat('a', 'b'))
ab

Write a function called fence that has two parameters, original and wrapper and adds wrapper before and after original:

>> disp(fence('name', '*'))
*name*

Solution

function wrapped = fence(original, wrapper)
    %FENCE   Return original string, with wrapper prepended and appended

    wrapped = strcat(wrapper, original, wrapper);
end

Getting the Outside

If the variable s refers to a string, then s(1) is the string’s first character and s(end) is its last. Write a function called outer that returns a string made up of just the first and last characters of its input:

>> disp(outer('helium'))
hm

Solution

function ends = outer(s)
    %OUTER   Return first and last characters from a string

    ends = strcat(s(1), s(end));
end

Variables Inside and Outside Functions

Consider our function fahr_to_kelvin from earlier in the episode:

function ktemp = fahr_to_kelvin(ftemp)
  %FAHR_TO_KELVIN   Convert Fahrenheit to Kelvin
  ktemp = ((ftemp-32)*(5.0/9.0)) + 273.15;
end

What does the following code display when run — and why?

ftemp = 0
ktemp = 0

disp(fahr_to_kelvin(8))
disp(fahr_to_kelvin(41))
disp(fahr_to_kelvin(32))

disp(ktemp)

Solution

259.8167
278.1500
273.1500
0

ktemp is 0 because the function fahr_to_kelvin has no knowledge of the variable ktemp which exists outside of the function.

Once we start putting things in functions so that we can re-use them, we need to start testing that those functions are working correctly. To see how to do this, let’s write a function to center a dataset around a particular value:

function out = center(data, desired)
    out = (data - mean(data(:))) + desired;
end

We could test this on our actual data, but since we don’t know what the values ought to be, it will be hard to tell if the result was correct, Instead, let’s create a matrix of 0’s, and then center that around 3:

>> z = zeros(2,2);
>> center(z, 3)
ans =

   3   3
   3   3

That looks right, so let’s try out center function on our real data:

>> data = readmatrix('data/inflammation-01.csv');
>> centered = center(data(:), 0)

It’s hard to tell from the default output whether the result is correct–this is often the case when working with fairly large arrays–but, there are a few simple tests that will reassure us.

Let’s calculate some simple statistics:

>> disp([min(data(:)), mean(data(:)), max(data(:))])
0.00000    6.14875   20.00000

And let’s do the same after applying our center function to the data:

>> disp([min(centered(:)), mean(centered(:)), max(centered(:))])
   -6.1487   -0.0000   13.8513

That seems almost right: the original mean was about 6.1, so the lower bound from zero is now about -6.1. The mean of the centered data isn’t quite zero–we’ll explore why not in the challenges–but it’s pretty close. We can even go further and check that the standard deviation hasn’t changed:

>> std(data(:)) - std(centered(:))
5.3291e-15

The difference is very small. It’s still possible that our function is wrong, but it seems unlikely enough that we should probably get back to doing our analysis. We have one more task first, though: we should write some documentation for our function to remind ourselves later what it’s for and how to use it.

function out = center(data, desired)
    %CENTER   Center data around a desired value.
    %
    %       center(DATA, DESIRED)
    %
    %   Returns a new array containing the values in
    %   DATA centered around the value.

    out = (data  - mean(data(:))) + desired;
end

Comment lines immediately below the function definition line are called “help text”. Typing help function_name brings up the help text for that function:

>> help center
Center   Center data around a desired value.

    center(DATA, DESIRED)

Returns a new array containing the values in
DATA centered around the value.

Testing a Function

  1. Write a function called normalise that takes an array as input and returns an array of the same shape with its values scaled to lie in the range 0.0 to 1.0. (If L and H are the lowest and highest values in the input array, respectively, then the function should map a value v to (v - L)/(H - L).) Be sure to give the function a comment block explaining its use.

  2. Run help linspace to see how to use linspace to generate regularly-spaced values. Use arrays like this to test your normalise function.

Solution

  1. function out = normalise(in)
        %NORMALISE   Return original array, normalised so that the
        %            new values lie in the range 0 to 1.
    
        H = max(max(in));
        L = min(min(in));
        out = (in-L)/(H-L);
    end
    
  2. a = linspace(1, 10);   % Create evenly-spaced vector
    norm_a = normalise(a); % Normalise vector
    plot(a, norm_a)        % Visually check normalisation
    

Convert a script into a function

Write a function called plot_dataset which plots the three summary graphs (max, min, std) for a given inflammation data file.

The function should operate on a single data file, and should have two parameters: file_name and plot_switch. When called, the function should create the three graphs produced in the previous lesson. Whether they are displayed or saved to the results directory should be controlled by the value of plot_switch i.e. plot_dataset('data/inflammation-01.csv', 0) should display the corresponding graphs for the first data set; plot_dataset('data/inflammation-02.csv', 1) should save the figures for the second dataset to the results directory.

You should mostly be reusing code from the plot_all script.

Be sure to give your function help text.

Solution

function plot_dataset(file_name, plot_switch)
    %PLOT_DATASET    Perform analysis for named data file.
    %   Create figures to show average, max and min inflammation.
    %   Display plots in GUI using plot_switch = 0,
    %   or save to disk using plot_switch = 1.
    %
    %   Example:
    %       plot_dataset('data/inflammation-01.csv', 0)
    
    % Generate string for image name:
    img_name = replace(file_name, '.csv', '.png');
    img_name = replace(img_name, 'data', 'results');

    patient_data = readmatrix(file_name);
    
    if plot_switch == 1
    	figure('visible', 'off')
    else
    	figure('visible', 'on')
    end
    
    subplot(2, 2, 1)
    plot(mean(patient_data, 1))
    ylabel('average')
    
    subplot(2, 2, 2)
    plot(max(patient_data, [], 1))
    ylabel('max')
    
    subplot(2, 2, 3)
    plot(min(patient_data, [], 1))
    ylabel('min')
    
    if plot_switch == 1
        print(img_name, '-dpng')
        close()
    end
 end

Automate the analysis for all files

Modify the plot_all script so that as it loops over the data files, it calls the function plot_dataset for each file in turn. Your script should save the image files to the ‘results’ directory rather than displaying the figures in the MATLAB GUI.

Solution

%PLOT_ALL    Analyse all inflammation datasets
%   Create figures to show average, max and min inflammation.
%   Save figures to 'results' directory.

files = dir('data/inflammation-*.csv');

for i = 1:length(files)
    file_name = files(i).name;
    file_name = fullfile('data', file_name);

    % Process each data set, saving figures to disk.
    plot_dataset(file_name, 1);
end

We have now solved our original problem: we can analyze any number of data files with a single command. More importantly, we have met two of the most important ideas in programming:

  1. Use arrays to store related values, and loops to repeat operations on them.

  2. Use functions to make code easier to re-use and easier to understand.

Key Points

  • Break programs up into short, single-purpose functions with meaningful names.

  • Define functions using the function keyword.


Defensive Programming

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • How can I make sure my programs are correct?

Objectives
  • Explain what an assertion is.

  • Add assertions to programs that correctly check the program’s state.

  • Correctly add precondition and postcondition assertions to functions.

  • Explain what test-driven development is, and use it when creating new functions.

  • Explain why variables should be initialized using actual data values rather than arbitrary constants.

  • Debug code containing an error systematically.

Our previous lessons have introduced the basic tools of programming: variables and lists, file I/O, loops, conditionals, and functions. What they haven’t done is show us how to tell whether a program is getting the right answer, and how to tell if it’s still getting the right answer as we make changes to it.

To achieve that, we need to:

The good news is, doing these things will speed up our programming, not slow it down.

The first step toward getting the right answers from our programs is to assume that mistakes will happen and to guard against them. This is called defensive programming, and the most common way to do it is to add assertions to our code so that it checks itself as it runs. An assertion is simply a statement that something must be true at a certain point in a program. When MATLAB sees one, it checks the assertion’s condition. If it’s true, MATLAB does nothing, but if it’s false, MATLAB halts the program immediately and prints an error message. For example, this piece of code halts as soon as the loop encounters a value that isn’t positive:

numbers = [1.5, 2.3, 0.7, -0.001, 4.4];
total = 0;

for n = numbers
    assert(n >= 0)
    total = total + n;
end
error: assert (n >= 0) failed

Programs like the Firefox browser are full of assertions: 10-20% of the code they contain are there to check that the other 80-90% are working correctly. Broadly speaking, assertions fall into three categories:

For example, suppose we are representing rectangles using an array of four coordinates (x0, y0, x1, y1), such that (x0,y0) are the bottom left coordinates, and (x1,y1) are the top right coordinates. In order to do some calculations, we need to normalize the rectangle so that it is at the origin, measures 1.0 units on its longest axis, and is oriented so the longest axis is the y axis. Here is a function that does that, but checks that its input is correctly formatted and that its result makes sense:

function normalized = normalize_rectangle(rect)
    %NORMALIZE_RECTANGLE
    % Normalizes a rectangle so that it is at the origin
    % measures 1.0 units on its longest axis
    % and is oriented with the longest axis in the y direction:

    assert(length(rect) == 4, 'Rectangle must contain 4 coordinates')

    x0 = rect(1);
    y0 = rect(2);
    x1 = rect(3);
    y1 = rect(4);

    assert(x0 < x1, 'Invalid X coordinates')
    assert(y0 < y1, 'Invalid Y coordinates')

    dx = x1 - x0;
    dy = y1 - y0;

    if dx > dy
        scaled = dx/dy;
        upper_x = scaled;
        upper_y = 1.0;
    else
        scaled = dx/dy;
        upper_x = scaled;
        upper_y = 1.0;
    end

    assert ((0 < upper_x) && (upper_x <= 1.0), 'Calculated upper X coordinate invalid');
    assert ((0 < upper_y) && (upper_y <= 1.0), 'Calculated upper Y coordinate invalid');

    normalized = [0, 0, upper_x, upper_y];
end

The three preconditions catch invalid inputs:

>> normalize_rectangle([0, 0, 1])
Error using normalize_rectangle (line 6)
Rectangle must contain 4 coordinates
>> normalize_rectangle([1,0,0,0,0])
Error: Rectangle must contain 4 coordinates
>> normalize_rectangle([1, 0, 0, 2]);
error: Invalid X coordinates

The post-conditions help us catch bugs by telling us when our calculations cannot have been correct. For example, if we normalize a rectangle that is taller than it is wide, everything seems OK:

>> normalize_rectangle([0, 0, 1.0, 5.0]);
0.00000   0.00000   0.20000   1.00000

but if we normalize one that’s wider than it is tall, the assertion is triggered:

>> normalize_rectangle([0.0, 0.0, 5.0, 1.0])
error: Calculated upper X coordinate invalid

Re-reading our function, we realize that line 19 should divide dy by dx. If we had left out the assertion at the end of the function, we would have created and returned something that had the right shape as a valid answer, but wasn’t. Detecting and debugging that would almost certainly have taken more time in the long run than writing the assertion.

But assertions aren’t just about catching errors: they also help people understand programs. Each assertion gives the person reading the program a chance to check (consciously or otherwise) that their understanding matches what the code is doing.

Most good programmers follow two rules when adding assertions to their code. The first is, fail early, fail often. The greater the distance between when and where an error occurs and when it’s noticed, the harder the error will be to debug, so good code catches mistakes as early as possible.

The second rule is, turn bugs into assertions or tests. If you made a mistake in a piece of code, the odds are good that you have made other mistakes nearby, or will make the same mistake (or a related one) the next time you change it. Writing assertions to check that you haven’t regressed (i.e., haven’t re-introduced an old problem) can save a lot of time in the long run, and helps to warn people who are reading the code (including your future self) that this bit is tricky.

Finding Conditions

Suppose you are writing a function called average that calculates the average of the numbers in a list. What pre-conditions and post-conditions would you write for it? Compare your answer to your neighbor’s: can you think of an input array that will pass your tests but not hers or vice versa?

An assertion checks that something is true at a particular point in the program. The next step is to check the overall behavior of a piece of code, i.e., to make sure that it produces the right output when it’s given a particular input. For example, suppose we need to find where two or more time series overlap. The range of each time series is represented as a pair of numbers, which are the time the interval started and ended. The output is the largest range that they all include:

Overlapping Ranges

Most novice programmers would solve this problem like this:

  1. Write a function range_overlap.
  2. Call it interactively on two or three different inputs.
  3. If it produces the wrong answer, fix the function and re-run that test.

This clearly works–after all, thousands of scientists are doing it right now–but there’s a better way:

  1. Write a short function for each test.
  2. Write a range_overlap function that should pass those tests.
  3. If range_overlap produces any wrong answers, fix it and re-run the test functions.

Writing the tests before writing the function they exercise is called test-driven development (TDD). Its advocates believe it produces better code faster because:

  1. If people write tests after writing the thing to be tested, they are subject to confirmation bias, i.e., they subconsciously write tests to show that their code is correct, rather than to find errors.

  2. Writing tests helps programmers figure out what the function is actually supposed to do.

Below are three test functions for range_overlap, but first we need a brief aside to explain some MATLAB behaviour.

  1. Argument format required for assert

    assert requires a scalar logical input argument (i.e. true (1) or false(0)).

  2. Testing for equality between matrices

    The syntax matA == matB returns a matrix of logical values. If we just want a true or false answer for the whole matrix (e.g. to use with assert), we need to use isequal instead.

Looking For Help

For a more detailed explanation, search the MATLAB help files (or type doc eq; doc assert at the command prompt).

%TEST_RANGE_OVERLAP   Test script for range_overlap function.

% assert(isnan(range_overlap([0.0, 1.0], [5.0, 6.0])))
% assert(isnan(range_overlap([0.0, 1.0], [1.0, 2.0])))
assert(isequal(range_overlap([0, 1.0]),[0, 1.0]))
assert(isequal(range_overlap([2.0, 3.0], [2.0, 4.0]),[2.0, 3.0]))
assert(isequal(range_overlap([0.0, 1.0], [0.0, 2.0], [-1.0, 1.0]),[0.0, 1.0]))
test_range_overlap
Undefined function or variable 'range_overlap'.

Error in test_range_overlap (line 6)
assert(isequal(range_overlap([0, 1.0]),[0, 1.0]))

The error is actually reassuring: we haven’t written range_overlap yet, so if the tests passed, it would be a sign that someone else had and that we were accidentally using their function.

And as a bonus of writing these tests, we’ve implicitly defined what our inputs and output look like: we expect two or more input arguments, each of which is a vector with length = 2, and we return a single vector as output.

Given that range_overlap will have to accept varying numbers of input arguments, we need to learn how to deal with an unknown number of input arguments before we can write range_overlap. Consider the example below from the MATLAB documentation:

function varlist(varargin)
   %VARLIST Demonstrate variable number of input arguments

   fprintf('Number of arguments: %d\n',nargin)
   celldisp(varargin)
varlist(ones(3),'some text',pi)
Number of arguments: 3

varargin{1} =
     1     1     1
     1     1     1
     1     1     1
 
varargin{2} =
some text

varargin{3} =
    3.1416

MATLAB has a special variable varargin which can be used as the last parameter in a function definition to represent a variable number of inputs. Within the function varargin is a cell array containing the input arguments, and the variable nargin gives the number of input arguments used during the function call. A cell array is a MATLAB data type with indexed data containers called cells. Each cell can contain any type of data, and is indexed using braces, or “curly brackets” {}.

Variable number of input arguments

Using what we have just learned about varargin and nargin fill in the blanks below to complete the first draft of our range_overlap function.

function overlap = range_overlap(________)
    %RANGE_OVERLAP Return common overlap among a set of [low, high] ranges.

    lowest  = -inf;
    highest = +inf;

    % Loop over input arguments
    for i = 1:______
        % Set range equal to each input argument
        range   = ________{i};
        low     = range(1);
        high    = range(2);
        lowest  = max(lowest, low);
        highest = min(highest, high);
    end

    overlap = [lowest, highest];

Solution

function overlap = range_overlap(varargin)
    %RANGE_OVERLAP Return common overlap among a set of [low, high] ranges.

    lowest  = -inf;
    highest = +inf;

    % Loop over input arguments
    for i = 1:nargin
        % Set range equal to each input argument
        range   = varargin{i};
        low     = range(1);
        high    = range(2);
        lowest  = max(lowest, low);
        highest = min(highest, high);
    end

    overlap = [lowest, highest];

Now that we have written the function range_overlap, when we run the tests:

test_range_overlap

we shouldn’t see an error.

Something important is missing, though. We don’t have any tests for the case where the ranges don’t overlap at all, or for the case where the ranges overlap at a point. We’ll leave this as a final assignment.

Fixing Code Through Testing

Fix range_overlap. Uncomment the two commented lines in test_range_overlap. You’ll see that our objective is to return a special value: NaN (Not a Number), for the following cases:

  • The ranges don’t overlap.
  • The ranges overlap at their endpoints.

As you make change to the code, run test_range_overlap regularly to make sure you aren’t breaking anything. Once you’re done, running test_range_overlap shouldn’t raise any errors.

Hint: read about the isnan function in the help files to make sure you understand what these first two lines are doing.

Solution

function overlap = range_overlap(varargin)
    %RANGE_OVERLAP Return common overlap among a set of [low, high] ranges.

    lowest = -inf;
    highest = +inf;

    % Loop over input arguments
    for i = 1:nargin
        % Set range equal to each input argument
        range   = varargin{i};
        low     = range(1);
        high    = range(2);
        lowest  = max(lowest, low);
        highest = min(highest, high);
        
        % Catch non-overlapping ranges
	    if low >= highest || high<=lowest
            output_range = NaN;
            return
	    end
    end

    overlap = [lowest, highest];

Debugging strategies

Once testing has uncovered problems, the next step is to fix them. Many novices do this by making more-or-less random changes to their code until it seems to produce the right answer, but that’s very inefficient (and the result is usually only correct for the one case they’re testing). The more experienced a programmer is, the more systematically they debug, and most follow some variation on the rules explained below.

The first step in debugging something is to know what it’s supposed to do. “My program doesn’t work” isn’t good enough: in order to diagnose and fix problems, we need to be able to tell correct output from incorrect. If we can write a test case for the failing case—i.e., if we can assert that with these inputs, the function should produce that result—then we’re ready to start debugging. If we can’t, then we need to figure out how we’re going to know when we’ve fixed things.

But writing test cases for scientific software is frequently harder than writing test cases for commercial applications, because if we knew what the output of the scientific code was supposed to be, we wouldn’t be running the software: we’d be writing up our results and moving on to the next program. In practice, scientists tend to do the following:

  1. Test with simplified data. Before doing statistics on a real data set, we should try calculating statistics for a single record, for two identical records, for two records whose values are one step apart, or for some other case where we can calculate the right answer by hand.

  2. Test a simplified case. If our program is supposed to simulate magnetic eddies in rapidly-rotating blobs of supercooled helium, our first test should be a blob of helium that isn’t rotating, and isn’t being subjected to any external electromagnetic fields. Similarly, if we’re looking at the effects of climate change on speciation, our first test should hold temperature, precipitation, and other factors constant.

  3. Compare to an oracle. A test oracle is something—experimental data, an older program whose results are trusted, or even a human expert—against which we can compare the results of our new program. If we have a test oracle, we should store its output for particular cases so that we can compare it with our new results as often as we like without re-running that program.

  4. Check conservation laws. Mass, energy, and other quantities are conserved in physical systems, so they should be in programs as well. Similarly, if we are analyzing patient data, the number of records should either stay the same or decrease as we move from one analysis to the next (since we might throw away outliers or records with missing values). If “new” patients start appearing out of nowhere as we move through our pipeline, it’s probably a sign that something is wrong.

  5. Visualize. Data analysts frequently use simple visualizations to check both the science they’re doing and the correctness of their code (just as we did in the opening lesson of this tutorial). This should only be used for debugging as a last resort, though, since it’s very hard to compare two visualizations automatically.

We can only debug something when it fails, so the second step is always to find a test case that makes it fail every time. The “every time” part is important because few things are more frustrating than debugging an intermittent problem: if we have to call a function a dozen times to get a single failure, the odds are good that we’ll scroll past the failure when it actually occurs.

As part of this, it’s always important to check that our code is “plugged in”, i.e., that we’re actually exercising the problem that we think we are. Every programmer has spent hours chasing a bug, only to realize that they were actually calling their code on the wrong data set or with the wrong configuration settings, or are using the wrong version of the software entirely. Mistakes like these are particularly likely to happen when we’re tired, frustrated, and up against a deadline, which is one of the reasons late-night (or overnight) coding sessions are almost never worthwhile.

If it takes 20 minutes for the bug to surface, we can only do three experiments an hour. That doesn’t mean we’ll get less data in more time: we’re also more likely to be distracted by other things as we wait for our program to fail, which means the time we are spending on the problem is less focused. It’s therefore critical to make it fail fast.

As well as making the program fail fast in time, we want to make it fail fast in space, i.e., we want to localize the failure to the smallest possible region of code:

  1. The smaller the gap between cause and effect, the easier the connection is to find. Many programmers therefore use a divide and conquer strategy to find bugs, i.e., if the output of a function is wrong, they check whether things are OK in the middle, then concentrate on either the first or second half, and so on.

  2. N things can interact in N2 different ways, so every line of code that isn’t run as part of a test means more than one thing we don’t need to worry about.

Replacing random chunks of code is unlikely to do much good. (After all, if you got it wrong the first time, you’ll probably get it wrong the second and third as well.) Good programmers therefore change one thing at a time, for a reason. They are either trying to gather more information (“is the bug still there if we change the order of the loops?”) or test a fix (“can we make the bug go away by sorting our data before processing it?”).

Every time we make a change, however small, we should re-run our tests immediately, because the more things we change at once, the harder it is to know what’s responsible for what (those N2 interactions again). And we should re-run all of our tests: more than half of fixes made to code introduce (or re-introduce) bugs, so re-running all of our tests tells us whether we have regressed.

Good scientists keep track of what they’ve done so that they can reproduce their work, and so that they don’t waste time repeating the same experiments or running ones whose results won’t be interesting. Similarly, debugging works best when we keep track of what we’ve done and how well it worked. If we find ourselves asking, “Did left followed by right with an odd number of lines cause the crash? Or was it right followed by left? Or was I using an even number of lines?” then it’s time to step away from the computer, take a deep breath, and start working more systematically.

Records are particularly useful when the time comes to ask for help. People are more likely to listen to us when we can explain clearly what we did, and we’re better able to give them the information they need to be useful.

Version control is often used to reset software to a known state during debugging, and to explore recent changes to code that might be responsible for bugs. In particular, most version control systems have a blame command that will show who last changed particular lines of code…

And speaking of help: if we can’t find a bug in 10 minutes, we should be humble and ask for help. Just explaining the problem aloud is often useful, since hearing what we’re thinking helps us spot inconsistencies and hidden assumptions.

Asking for help also helps alleviate confirmation bias. If we have just spent an hour writing a complicated program, we want it to work, so we’re likely to keep telling ourselves why it should, rather than searching for the reason it doesn’t. People who aren’t emotionally invested in the code can be more objective, which is why they’re often able to spot the simple mistakes we have overlooked.

Part of being humble is learning from our mistakes. Programmers tend to get the same things wrong over and over: either they don’t understand the language and libraries they’re working with, or their model of how things work is wrong. In either case, taking note of why the error occurred and checking for it next time quickly turns into not making the mistake at all.

And that is what makes us most productive in the long run. As the saying goes, A week of hard work can sometimes save you an hour of thought. If we train ourselves to avoid making some kinds of mistakes, to break our code into modular, testable chunks, and to turn every assumption (or mistake) into an assertion, it will actually take us less time to produce working programs, not more.

Key Points

  • Use assertions to catch errors, and to document what behavior is expected.

  • Write tests before code.