12 Repetitive Jobs

Objective: To learn about tools to prevent copying and pasting the same code to replicate instructions

We will cover:
- how to use functions (review) - make your own function(){} - loops - vectorization - apply family (apply(), lapply())

12.1 Functions

Recall: A quick note on functions

  • Functions are a group of statements that together perform a specific task

  • Functions have a name, e.g. setwd, getwd

  • We can “call” functions (also called commands in this context) in R for convenience (so we don’t have to rewrite the code)

  • Use ? or help() command To find information for a particular function

?print
## starting httpd help server ... done
help(print)
  • Functions have the format: function_name( arguments )
  • “Arguments” are required or optional parameters used by the function to accomplish the action
  • Functions can “return” a value (or not)
# Type a comma and press tab to see arguments (function parameters) in RStudio
print(x = "Hello world")
## [1] "Hello world"
print(x = 33.9431249, digits = 4)
## [1] 33.94

How to define a function:

name_of_function <- function(list_of_arguments){ # code to do what function wants }

  • arguments/args: names of variables you will use in your function (seperated by commas, within parantheses)
  • return value: the value that is returned by a function (optional) - use return()

Make a function using function()

Make a function that adds 2 values, x and y

add <- function(x, y) { #arguments within parantheses
  x.y.sum <- x + y    # body of function within {}
  return(x.y.sum)     # return value specified in return()
}

Call your function the same way you use any other.

add(4,5) #returns 9; if not saved to a variable, it will print
## [1] 9
# add(4) # Error in add(4) : argument "y" is missing, with no default

Automatic Returns - In R, it is not necessary to include the return statement. R automatically returns whichever variable is on the last line of the body of the function.

Default arguments Set defaults if not specified by user.

# Define a function
add <- function(x, y=0) { # x is required argument, but if y is not specified, it is 0
  x.y.sum <- x + y    
  return(x.y.sum)    
}
# Use function
add(4,5)
## [1] 9
add(4)
## [1] 4

Define a function lb.to.kg that converts weights from pounds (lb) to kilogram (kg):

# for an approximate result, divide the mass value by 2.205
lb.to.kg <- function(wt_lb) { #argument within parantheses
  wt_kg <- wt_lb/2.205      # body of function within {}
  wt_kg <- round(wt_kg, digits = 3)
  return(wt_kg)           # return value specified in return()
}

# Let's try running our function. Calling our own function is no different from calling any other function:
lb.to.kg(32)
## [1] 14.512
lb.to.kg(212)
## [1] 96.145

Summary: - Define a function using name <- function(args…) { body… } - Call a function using name(values.. )

  • taste of how larger programs are built: we define basic operations, then combine them in ever-larger chunks to get the effect we want.

12.2 Loops

Based on following tutorial: https://swcarpentry.github.io/r-novice-inflammation/03-loops-R/index.html

Suppose we want to print each word in a sentence. One way is to use six print statements:

# Make a character vector
sentence1 <-  c("Try", "printing", "this")

# Make a function to print each value in sentence vector
print_sentence <- function(sentence) {
  print(sentence[1])
  print(sentence[2])
  print(sentence[3])
}
# Call function
print_sentence(sentence1)
## [1] "Try"
## [1] "printing"
## [1] "this"

However this function won’t work as intended (doesn’t scale and fragile) if there are more than 3 values in sentence argument.

# Make another sentence vector
sentence2 <- c("Let", "the", "computer", "do", "the", "work")
# Call function
print_sentence(sentence2)
## [1] "Let"
## [1] "the"
## [1] "computer"

Or if the sentence is shorter, NAs (missing values) are introduced.

# Make another sentence vector
sentence3 <- c("Try", "this")
# Call function
print_sentence(sentence3)
## [1] "Try"
## [1] "this"
## [1] NA

Loop - All modern programming languages provide special constructs that allow for the repetition of instructions or blocks of instructions.

  • There are 3 main types of loops in R: for, while, repeat
  • You can add break; and next; to skip a iteration if it does not passed a test (ie. condition)

Read more: https://www.datacamp.com/community/tutorials/tutorial-on-loops-in-r

The ‘for’ loop construct

for (variable in collection) {
  # do things with variable
}

Eg. make a for loop

# Make a for loop to print out each element in sentence1
for (word in sentence1){ # word will take on the value of "sentence1" element by element until there are no elements left
  # print the current word
  print(word)
}
## [1] "Try"
## [1] "printing"
## [1] "this"

Add a condition

# Make a for loop to print out each element in sentence1
for (word in sentence1){ # word will take on the value of "sentence1" element by element until there are no elements left
  # Skip current iteration if the current word is equal to  "printing"
  if(word == "printing"){
    next;
  }
  # print the current word
  print(word)
}
## [1] "Try"
## [1] "this"

Eg. make a while loop

# Intialize an index counter 
i <- 1 

# Make a while loop to print out each element in sentence1
# Note: length() is the number of elements in a vector or list
while (i <= length(sentence1)){ # i will test this condition and run the code within {} until it is FALSE
  # Define word variable by indexing sentence1
  word <- sentence1[i]
  # print the current word
  print(word)
  # Update counter by 1 - i.e. move to the next index - must do this or you'll end up with an infinite loop!
  i <- i+1
}
## [1] "Try"
## [1] "printing"
## [1] "this"

Applications: - processing/analyzing multiple files (look at linked tutorial) - plotting graphs from columns in same dataframe

# eg. Use a for loop to apply same preprocessing steps to multiple csv files
for (filename in list.files(pattern="csv")){ 
  # Read csv file into a variable called df
  df <- read.csv(filename)
  
  if(is.null(df)){
    next;
  }
  
  # downstream processing df
}

12.3 Vectorization

  • Alternative to looping
  • Vectorization is a feature of R that allows you to apply an operation to data/vectors at the same time
  • This makes code more readable
  • most functions in R perform vectorization intrinsically, such as colSums(), rowMeans(), and even basic arithmetic operations (e.g +, -, *)

e.g. Power of vectorization: Add 2 to all values

# Define a numeric vector called values
values <- c(3,5,6,10)

# In many programming languages you may have to run a loop to do this
for(value in values){
  new_value <- value + 2
  print(new_value)
}
## [1] 5
## [1] 7
## [1] 8
## [1] 12
# But R lets you add 2 to all values at once - this is called vectorization
values + 2
## [1]  5  7  8 12

12.4 Intro to apply

  • apply family functions utilize the concept of vectorization

  • functions: apply(), sapply(), lapply(), mapply(), rapply(), tapply(), vapply()

  • within these commands, you may apply a function or operation to the values

  • put name of function into FUN (no parantheses) or make your own

  • most frequently used are:

  1. lapply (“list” apply)
  • applies a function on 1D data - list or vector
  • returns a list
  • note: use unlist() to convert the resulting list to vector
# e.g. print vector elements
lapply(sentence1, FUN = print)
## [1] "Try"
## [1] "printing"
## [1] "this"
## [[1]]
## [1] "Try"
## 
## [[2]]
## [1] "printing"
## 
## [[3]]
## [1] "this"
# e.g. add 2 to each element by making your own function
lapply(values, FUN = function(value){
  value + 2
})
## [[1]]
## [1] 5
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] 8
## 
## [[4]]
## [1] 12
  1. apply
  • apply a function to 2D data - matrix or data frame
  • specify if function should be apply on rows or columns (using MARGIN argument: 1 = row, 2 = column)
# Read input table
df <- read.delim(file = "07-Casey-mamm-mouse-proteome-sample.txt", row.names = 1)

# Get standard deviation of columns
apply(df, MARGIN = 2, FUN = sd)
##     BC_E_1     BC_E_2    BC_EP_1    BC_EP_2     LP_E_1     LP_E_2    LP_EP_1 
##  216497329  250258435  175483149  189537235  359922127  271978322  212250113 
##    LP_EP_2     LM_E_1     LM_E_2    LM_EP_1    LM_EP_2 
##  238676624 1550773037 1494197893  687720907  625922913
# Transform data to z-scores (which account for sd of each row)
z_scores <- apply(df, MARGIN = 1, FUN = function(x){ # in this function, x is a numeric vector of each row
  (x-mean(x))/sd(x)
})
# Look at first 6 rows and 6 columns
z_scores[1:6, 1:6]
##              Abcb10      Abcg2       Abhd4      Abhd5       Acox1        Ada
## BC_E_1  -0.88084429 -0.8165888  0.02080524 -0.6218765  0.34686146 -0.8860501
## BC_E_2  -0.88084429 -0.8165888  0.05777697 -0.6218765  0.50786325 -0.9420582
## BC_EP_1 -0.88084429 -0.3873079  1.86485582 -0.6218765  0.01997860 -1.0392117
## BC_EP_2 -0.88084429 -0.6085926  1.26884306 -0.6218765 -0.05682232 -0.9485052
## LP_E_1  -0.05139152  1.7598003 -1.06841075 -0.5106032 -0.59603606 -0.3825228
## LP_E_2  -0.34103905  0.7103968 -1.06841075 -0.2380836 -0.13543827  0.1221950
# Note apply() has transposed our original matrix; re-transpose using t()
z_scores <- t(z_scores)

12.5 Practice

Converting cm to in a) Define a function called “cm_to_in”. This function takes in a numeric variable and converts it from centimeters to inches by dividing the length value by 2.54. Round to 2 significant digits using round(). Return the result. b) Test your function on any number. c) Make a numeric vector of 5 different values called “cm_measurements”. d) In a loop, - convert the cm vector to inches using your function in a) and save to variable called “in_inches” - print “in_inches” to console using print() e) Use the lapply() function to apply your function to each element. f) Pass your vector into your function. What’s the result? This is called vectorization.

Solution

# a) make a function
cm_to_in <- function(value){
  # Divide value by 2.54
  result <- value/2.54
  # Round to 2 digits
  result <- round(result, digits = 2)
  return(result)
}
# b) pass in single numeric value to function
cm_to_in(45)
## [1] 17.72
# c) make a numeric vector
cm_measurements <- c(35,63,53,67,32)
# d)  make for loop
for(in_cm in cm_measurements){
  # Save output of function to variable
  in_inches <- cm_to_in(in_cm)
  # Print
  print(in_inches)
}
## [1] 13.78
## [1] 24.8
## [1] 20.87
## [1] 26.38
## [1] 12.6
# e) use lapply() on vector
lapply(cm_measurements, FUN = cm_to_in)
## [[1]]
## [1] 13.78
## 
## [[2]]
## [1] 24.8
## 
## [[3]]
## [1] 20.87
## 
## [[4]]
## [1] 26.38
## 
## [[5]]
## [1] 12.6
# f) pass in whole vector to function
cm_to_in(cm_measurements)
## [1] 13.78 24.80 20.87 26.38 12.60