12 Repetitive Jobs
Objective: To learn about tools to prevent copying and pasting the same code to replicate instructions
We will cover:
- how to use functions (review)
- make your own function(){}
- loops
- vectorization - apply family (apply(), lapply())
12.1 Functions
Recall: A quick note on functions
Functions are a group of statements that together perform a specific task
Functions have a name, e.g. setwd, getwd
We can “call” functions (also called commands in this context) in R for convenience (so we don’t have to rewrite the code)
Use ? or help() command To find information for a particular function
?print
## starting httpd help server ... done
help(print)
- Functions have the format: function_name( arguments )
- “Arguments” are required or optional parameters used by the function to accomplish the action
- Functions can “return” a value (or not)
# Type a comma and press tab to see arguments (function parameters) in RStudio
print(x = "Hello world")
## [1] "Hello world"
print(x = 33.9431249, digits = 4)
## [1] 33.94
How to define a function:
name_of_function <- function(list_of_arguments){ # code to do what function wants }
- arguments/args: names of variables you will use in your function (seperated by commas, within parantheses)
- return value: the value that is returned by a function (optional) - use return()
Make a function using function()
Make a function that adds 2 values, x and y
<- function(x, y) { #arguments within parantheses
add <- x + y # body of function within {}
x.y.sum return(x.y.sum) # return value specified in return()
}
Call your function the same way you use any other.
add(4,5) #returns 9; if not saved to a variable, it will print
## [1] 9
# add(4) # Error in add(4) : argument "y" is missing, with no default
Automatic Returns - In R, it is not necessary to include the return statement. R automatically returns whichever variable is on the last line of the body of the function.
Default arguments Set defaults if not specified by user.
# Define a function
<- function(x, y=0) { # x is required argument, but if y is not specified, it is 0
add <- x + y
x.y.sum return(x.y.sum)
}# Use function
add(4,5)
## [1] 9
add(4)
## [1] 4
Define a function lb.to.kg that converts weights from pounds (lb) to kilogram (kg):
# for an approximate result, divide the mass value by 2.205
<- function(wt_lb) { #argument within parantheses
lb.to.kg <- wt_lb/2.205 # body of function within {}
wt_kg <- round(wt_kg, digits = 3)
wt_kg return(wt_kg) # return value specified in return()
}
# Let's try running our function. Calling our own function is no different from calling any other function:
lb.to.kg(32)
## [1] 14.512
lb.to.kg(212)
## [1] 96.145
Summary: - Define a function using name <- function(args…) { body… } - Call a function using name(values.. )
- taste of how larger programs are built: we define basic operations, then combine them in ever-larger chunks to get the effect we want.
12.2 Loops
Based on following tutorial: https://swcarpentry.github.io/r-novice-inflammation/03-loops-R/index.html
Suppose we want to print each word in a sentence. One way is to use six print statements:
# Make a character vector
<- c("Try", "printing", "this")
sentence1
# Make a function to print each value in sentence vector
<- function(sentence) {
print_sentence print(sentence[1])
print(sentence[2])
print(sentence[3])
}# Call function
print_sentence(sentence1)
## [1] "Try"
## [1] "printing"
## [1] "this"
However this function won’t work as intended (doesn’t scale and fragile) if there are more than 3 values in sentence argument.
# Make another sentence vector
<- c("Let", "the", "computer", "do", "the", "work")
sentence2 # Call function
print_sentence(sentence2)
## [1] "Let"
## [1] "the"
## [1] "computer"
Or if the sentence is shorter, NAs (missing values) are introduced.
# Make another sentence vector
<- c("Try", "this")
sentence3 # Call function
print_sentence(sentence3)
## [1] "Try"
## [1] "this"
## [1] NA
Loop - All modern programming languages provide special constructs that allow for the repetition of instructions or blocks of instructions.
- There are 3 main types of loops in R: for, while, repeat
- You can add break; and next; to skip a iteration if it does not passed a test (ie. condition)
Read more: https://www.datacamp.com/community/tutorials/tutorial-on-loops-in-r
The ‘for’ loop construct
for (variable in collection) {
# do things with variable
}
Eg. make a for loop
# Make a for loop to print out each element in sentence1
for (word in sentence1){ # word will take on the value of "sentence1" element by element until there are no elements left
# print the current word
print(word)
}
## [1] "Try"
## [1] "printing"
## [1] "this"
Add a condition
# Make a for loop to print out each element in sentence1
for (word in sentence1){ # word will take on the value of "sentence1" element by element until there are no elements left
# Skip current iteration if the current word is equal to "printing"
if(word == "printing"){
next;
}# print the current word
print(word)
}
## [1] "Try"
## [1] "this"
Eg. make a while loop
# Intialize an index counter
<- 1
i
# Make a while loop to print out each element in sentence1
# Note: length() is the number of elements in a vector or list
while (i <= length(sentence1)){ # i will test this condition and run the code within {} until it is FALSE
# Define word variable by indexing sentence1
<- sentence1[i]
word # print the current word
print(word)
# Update counter by 1 - i.e. move to the next index - must do this or you'll end up with an infinite loop!
<- i+1
i }
## [1] "Try"
## [1] "printing"
## [1] "this"
Applications: - processing/analyzing multiple files (look at linked tutorial) - plotting graphs from columns in same dataframe
# eg. Use a for loop to apply same preprocessing steps to multiple csv files
for (filename in list.files(pattern="csv")){
# Read csv file into a variable called df
<- read.csv(filename)
df
if(is.null(df)){
next;
}
# downstream processing df
}
12.3 Vectorization
- Alternative to looping
- Vectorization is a feature of R that allows you to apply an operation to data/vectors at the same time
- This makes code more readable
- most functions in R perform vectorization intrinsically, such as colSums(), rowMeans(), and even basic arithmetic operations (e.g +, -, *)
e.g. Power of vectorization: Add 2 to all values
# Define a numeric vector called values
<- c(3,5,6,10)
values
# In many programming languages you may have to run a loop to do this
for(value in values){
<- value + 2
new_value print(new_value)
}
## [1] 5
## [1] 7
## [1] 8
## [1] 12
# But R lets you add 2 to all values at once - this is called vectorization
+ 2 values
## [1] 5 7 8 12
12.4 Intro to apply
apply family functions utilize the concept of vectorization
functions: apply(), sapply(), lapply(), mapply(), rapply(), tapply(), vapply()
within these commands, you may apply a function or operation to the values
put name of function into FUN (no parantheses) or make your own
most frequently used are:
- lapply (“list” apply)
- applies a function on 1D data - list or vector
- returns a list
- note: use unlist() to convert the resulting list to vector
# e.g. print vector elements
lapply(sentence1, FUN = print)
## [1] "Try"
## [1] "printing"
## [1] "this"
## [[1]]
## [1] "Try"
##
## [[2]]
## [1] "printing"
##
## [[3]]
## [1] "this"
# e.g. add 2 to each element by making your own function
lapply(values, FUN = function(value){
+ 2
value })
## [[1]]
## [1] 5
##
## [[2]]
## [1] 7
##
## [[3]]
## [1] 8
##
## [[4]]
## [1] 12
- apply
- apply a function to 2D data - matrix or data frame
- specify if function should be apply on rows or columns (using MARGIN argument: 1 = row, 2 = column)
# Read input table
<- read.delim(file = "07-Casey-mamm-mouse-proteome-sample.txt", row.names = 1)
df
# Get standard deviation of columns
apply(df, MARGIN = 2, FUN = sd)
## BC_E_1 BC_E_2 BC_EP_1 BC_EP_2 LP_E_1 LP_E_2 LP_EP_1
## 216497329 250258435 175483149 189537235 359922127 271978322 212250113
## LP_EP_2 LM_E_1 LM_E_2 LM_EP_1 LM_EP_2
## 238676624 1550773037 1494197893 687720907 625922913
# Transform data to z-scores (which account for sd of each row)
<- apply(df, MARGIN = 1, FUN = function(x){ # in this function, x is a numeric vector of each row
z_scores -mean(x))/sd(x)
(x
})# Look at first 6 rows and 6 columns
1:6, 1:6] z_scores[
## Abcb10 Abcg2 Abhd4 Abhd5 Acox1 Ada
## BC_E_1 -0.88084429 -0.8165888 0.02080524 -0.6218765 0.34686146 -0.8860501
## BC_E_2 -0.88084429 -0.8165888 0.05777697 -0.6218765 0.50786325 -0.9420582
## BC_EP_1 -0.88084429 -0.3873079 1.86485582 -0.6218765 0.01997860 -1.0392117
## BC_EP_2 -0.88084429 -0.6085926 1.26884306 -0.6218765 -0.05682232 -0.9485052
## LP_E_1 -0.05139152 1.7598003 -1.06841075 -0.5106032 -0.59603606 -0.3825228
## LP_E_2 -0.34103905 0.7103968 -1.06841075 -0.2380836 -0.13543827 0.1221950
# Note apply() has transposed our original matrix; re-transpose using t()
<- t(z_scores) z_scores
12.5 Practice
Converting cm to in a) Define a function called “cm_to_in”. This function takes in a numeric variable and converts it from centimeters to inches by dividing the length value by 2.54. Round to 2 significant digits using round(). Return the result. b) Test your function on any number. c) Make a numeric vector of 5 different values called “cm_measurements”. d) In a loop, - convert the cm vector to inches using your function in a) and save to variable called “in_inches” - print “in_inches” to console using print() e) Use the lapply() function to apply your function to each element. f) Pass your vector into your function. What’s the result? This is called vectorization.
Solution
# a) make a function
<- function(value){
cm_to_in # Divide value by 2.54
<- value/2.54
result # Round to 2 digits
<- round(result, digits = 2)
result return(result)
}# b) pass in single numeric value to function
cm_to_in(45)
## [1] 17.72
# c) make a numeric vector
<- c(35,63,53,67,32)
cm_measurements # d) make for loop
for(in_cm in cm_measurements){
# Save output of function to variable
<- cm_to_in(in_cm)
in_inches # Print
print(in_inches)
}
## [1] 13.78
## [1] 24.8
## [1] 20.87
## [1] 26.38
## [1] 12.6
# e) use lapply() on vector
lapply(cm_measurements, FUN = cm_to_in)
## [[1]]
## [1] 13.78
##
## [[2]]
## [1] 24.8
##
## [[3]]
## [1] 20.87
##
## [[4]]
## [1] 26.38
##
## [[5]]
## [1] 12.6
# f) pass in whole vector to function
cm_to_in(cm_measurements)
## [1] 13.78 24.80 20.87 26.38 12.60