4 Data Structures
Objective: Learn how to work with vectors and data frames in R.
We will cover:
- vector creation using c()
- Positive and negative indexing using []
- Vector functions (mean, sd, sort, max, min, etc)
- Dataframe (), how to access data [rows, cols], subset, and important functions ** Note: info about list(), factor(), and matrix() are included in this tutorial but we will not cover these structures together
Data structure: - organization, management, and storage format for data - enables efficient access and modification
Based on this tutorial: http://www.sthda.com/english/wiki/easy-r-programming-basics#basic-arithmetic-operations
4.1 Vectors
- A vector in R is a combination of multiple values of the same data type (numeric, character or logical) in the same object/variable
- each value is called an element
- also called “array”
- created using the function c() (for concatenate)
- We can give a name to the elements of a vector using the function names()
# Store cell types in a character vector
<- c("neutrophil", "NK", "macrophage", "B-cell") # create
cell.types # print cell.types
## [1] "neutrophil" "NK" "macrophage" "B-cell"
# Store the expression level in a numeric vector (arbitrary)
<- c(78,20,53,0)
expr_lvls expr_lvls
## [1] 78 20 53 0
# Store whether it is from the myeloid lineage
<- c(T, F, T, F)
is_myeloid is_myeloid
## [1] TRUE FALSE TRUE FALSE
# Name expresson levels by cell type
names(expr_lvls) <- cell.types
expr_lvls
## neutrophil NK macrophage B-cell
## 78 20 53 0
# conversely, can create named vector as follows:
<- c(neutrophil = 78, NK = 20,
expr_lvls macrophage = 53, B_cell = 0)
unname(expr_lvls) #unname
## [1] 78 20 53 0
# combine vectors
c(expr_lvls, cell.types) #converts mixed data types to the same one based on data type hierarchies, character > numeric > logical
## neutrophil NK macrophage B_cell
## "78" "20" "53" "0" "neutrophil" "NK"
##
## "macrophage" "B-cell"
Case of missing values
- Missing information are represented by NA.
<- c(neutrophil = 78, NK = 20, macrophage = 53, B_cell = NA)
expr_lvls is.na(expr_lvls) # note: an example of vectorization where you can apply a function to a vector as if it were just one value
## neutrophil NK macrophage B_cell
## FALSE FALSE FALSE TRUE
Other ways to create vectors:
* using : operator
* using seq() function (create a sequence)
# Make a numeric vector using :
1:4
## [1] 1 2 3 4
# Specify first and last element using seq()
seq(from = 1.2, to = 3, by=0.2) # specify step size
## [1] 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
seq(1, 5, length.out=4) # specify length of the vector
## [1] 1.000000 2.333333 3.666667 5.000000
Access elements of a vector
- Elements of a vector can be accessed using vector indexing
- the vector used for indexing can be logical, integer or character vector
Using integer vector as index
- Vector index in R starts from 1 (unlike most programming languages where index start from 0)
- We can use a vector of integers as index to access specific elements
- We can use negative integers to return all elements except that those specified
<- 11:20
x 3] x[
## [1] 13
c(2, 4)] # access 2nd and 4th element x[
## [1] 12 14
-1] # access all but 1st element x[
## [1] 12 13 14 15 16 17 18 19 20
# x[c(2, -4)] # ERROR: cannot mix positive and negative integers
c(2.4, 3.54)] # real numbers are truncated to integers x[
## [1] 12 13
NOTE: I am not saving my output into a variable so it does not modify the original “x” variable
Using logical vector as index - When we use a logical vector for indexing, the position where the logical vector is TRUE is returned
<- c(322, 39, 234, 890)
intensity_lvl c(TRUE, FALSE, FALSE, TRUE)] #length of logical vector must match length of vector intensity_lvl[
## [1] 322 890
<- 300
THRESHOLD < THRESHOLD] # filtering vectors based on conditions intensity_lvl[intensity_lvl
## [1] 39 234
> 0] intensity_lvl[intensity_lvl
## [1] 322 39 234 890
Using character vector as index - This type of indexing is useful when dealing with named vectors
<- c(neutrophil = 78, NK = 20, macrophage = 53, B_cell = NA)
expr_lvls "neutrophil"] expr_lvls[
## neutrophil
## 78
c("neutrophil", "NK")] expr_lvls[
## neutrophil NK
## 78 20
Modify and delete vectors
- access specific elements and modify them using the assignment operator
- perform arithmetic and logical operations on the vector
- delete using NULL keyword
<- 1:4
x # modify the first element
1] <- 15
x[ x
## [1] 15 2 3 4
# arithmetic operator
+ 2 x
## [1] 17 4 5 6
/x x
## [1] 1 1 1 1
# logical operator
!c(T,F)
## [1] FALSE TRUE
# delete vector
<- NULL
x 1] x[
## NULL
Vector functions
- Some useful functions are:
<- seq(10, 100, by = 10)
x max(x) # Get the maximum value of x
## [1] 100
min(x) # Get the minimum value of x
## [1] 10
range(x) # Get the range of x (min, max)
## [1] 10 100
length(x) # Get the number of elements in x
## [1] 10
sum(x) # Get the total of the elements in x
## [1] 550
prod(x) # Get the product of the elements in x
## [1] 3.6288e+16
mean(x) # The mean value of the elements in x - sum(x)/length(x)
## [1] 55
sd(x) # Standard deviation of x
## [1] 30.2765
var(x) # Variance of x
## [1] 916.6667
sort(x) # Sort the element of x in ascending order
## [1] 10 20 30 40 50 60 70 80 90 100
Note: if you want to exclude NAs, most of these functions have a na.rm argument.
4.2 Data Frames
- Data frames is a table (matrix-like 2D object) where columns can have different vector types (numeric, character, logical)
- Arguably the most useful data structure in R
- Create a data frame using data.frame(), specifying columns
Some important functions
- dim(), nrow() and ncol() - return the dimensions, number of rows and columns
- summary(), str() - give you information like stats, dimensions, data types, etc in your data frame
- rownames() - retrieve or set row names of a matrix-like object
- colnames() - retrieve or set column names of a matrix-like object
- cbind() - combine R objects by columns
- rbind() - combine R objects by rows
- t() - transpose the matrix (columns become rows and vice-versa)
- rowSums() and colSums() functions: Compute the total of each row and the total of each column (when data frame is numeric)
<- data.frame(
cells_df Name = cell.types,
Expression = expr_lvls,
myeloid_lineage = is_myeloid,
Intensity1 = c(258, NA, 185, 290),
stringsAsFactors = F # set this to F for now, think of factors levels of for vectors
) cells_df
## Name Expression myeloid_lineage Intensity1
## neutrophil neutrophil 78 TRUE 258
## NK NK 20 FALSE NA
## macrophage macrophage 53 TRUE 185
## B_cell B-cell NA FALSE 290
# Check if it's a data frame
is.data.frame(cells_df)
## [1] TRUE
# Get dimensions
dim(cells_df)
## [1] 4 4
# Get column names
colnames(cells_df)
## [1] "Name" "Expression" "myeloid_lineage" "Intensity1"
# Rename rows
rownames(cells_df) <- paste("Cell", 1:nrow(cells_df), sep =".")
cells_df
## Name Expression myeloid_lineage Intensity1
## Cell.1 neutrophil 78 TRUE 258
## Cell.2 NK 20 FALSE NA
## Cell.3 macrophage 53 TRUE 185
## Cell.4 B-cell NA FALSE 290
# Add column/row # NOTE AGAIN: if I don't save my code into a variable, it does not modify the original dataframe
cbind(cells_df, Intensity2 = c(3315, 458, 5643, 100))
## Name Expression myeloid_lineage Intensity1 Intensity2
## Cell.1 neutrophil 78 TRUE 258 3315
## Cell.2 NK 20 FALSE NA 458
## Cell.3 macrophage 53 TRUE 185 5643
## Cell.4 B-cell NA FALSE 290 100
rbind(cells_df, Cell.5 = c("mast cell", NA, T, 452))
## Name Expression myeloid_lineage Intensity1
## Cell.1 neutrophil 78 TRUE 258
## Cell.2 NK 20 FALSE <NA>
## Cell.3 macrophage 53 TRUE 185
## Cell.4 B-cell <NA> FALSE 290
## Cell.5 mast cell <NA> TRUE 452
Access and subset a data frame
Access columns and rows by indexing by name and by location
Format is dataframe[row,column] # think of your rows and columns as vectors
Access columns by dollar sign $
# Access the data in 'name' column
1] # index by location cells_df[,
## [1] "neutrophil" "NK" "macrophage" "B-cell"
"Name"] # index by name of column cells_df[,
## [1] "neutrophil" "NK" "macrophage" "B-cell"
$Name # access using $ cells_df
## [1] "neutrophil" "NK" "macrophage" "B-cell"
# Access the data for Cell 2
"Cell.2",] cells_df[
## Name Expression myeloid_lineage Intensity1
## Cell.2 NK 20 FALSE NA
- Subset using logical expressions, positive indexing (specifiy which columns/rows to keep), and negative indexing to exclude columns/rows
- Subset using the subset() function
- If you subset using vectors with more than one element, it returns a dataframe, if not it will return a vector
- Modify the same way as vectors (specify which rows/columns value to access and use assignment to assign new values)
# Subset by selecting first 3 rows (both lines of code do the same thing)
c(1,2,3), ] cells_df[
## Name Expression myeloid_lineage Intensity1
## Cell.1 neutrophil 78 TRUE 258
## Cell.2 NK 20 FALSE NA
## Cell.3 macrophage 53 TRUE 185
1:3, ] cells_df[
## Name Expression myeloid_lineage Intensity1
## Cell.1 neutrophil 78 TRUE 258
## Cell.2 NK 20 FALSE NA
## Cell.3 macrophage 53 TRUE 185
# Subset by using character vector
<- c("Expression", "Intensity1")
parameters <- cells_df[,parameters]
cells_df2 log2(cells_df2) # perform functions depending on data type
## Expression Intensity1
## Cell.1 6.285402 8.011227
## Cell.2 4.321928 NA
## Cell.3 5.727920 7.531381
## Cell.4 NA 8.179909
# log2(cells_df) #ERROR: log2 only works when all columns numeric
# Subset by selecting the rows that meet the condition (both lines of code do the same thing)
$Expression >= 25, ] cells_df[cells_df
## Name Expression myeloid_lineage Intensity1
## Cell.1 neutrophil 78 TRUE 258
## Cell.3 macrophage 53 TRUE 185
## NA <NA> NA NA NA
subset(cells_df, subset = Expression >= 25)
## Name Expression myeloid_lineage Intensity1
## Cell.1 neutrophil 78 TRUE 258
## Cell.3 macrophage 53 TRUE 185
# Reassign all NA to 0
is.na(cells_df)] <- 0
cells_df[
# Can add and remove columns using $
$Intensity2 <- c(3315, 458, 5643, 100) # add column, alternative to cbind()
cells_df cells_df
## Name Expression myeloid_lineage Intensity1 Intensity2
## Cell.1 neutrophil 78 TRUE 258 3315
## Cell.2 NK 20 FALSE 0 458
## Cell.3 macrophage 53 TRUE 185 5643
## Cell.4 B-cell 0 FALSE 290 100
$Intensity2 <- NULL # remove column cells_df
4.3 Practice
You are gathering information about your family members (alternatively, your friends or coworkers).
a) Make a vector of their names.
b) Who is the first person you wrote down? (i.e. Get the first element)
c) Make a vector of their ages (in same order as part a)).
d) Make a vector if they’re a kid or not (TRUE/FALSE).
e) Make a data frame of your family members with column names: Name, Age, Is_Kid.
f) Sort their names by alphabetical order. The output should be saved to a data frame variable. (Look this up if you don’t know how to!)
g) Subset your data frame so only rows of the members that are children shows (do not save as variable).
h) Subset your data frame so only rows of the members that are older than 20 years old show (do not save as variable).
i) Add 1 to the ages of all your members in one line of code.
j) Remove the Is_Kid column.
Solution:
# a) Make a vector using c()
<- c("Tinky Winky", "Dipsy", "Laa Laa", "Po")
names # b) Use positive indexing
1] names[
## [1] "Tinky Winky"
# c) Make a vector using c()
<- c(21, 10, 10, 2)
ages # d) Make a logical vector using c()
<- c(F,T,T,T)
is_kid # e) Make a data frame using data.frame()
<- data.frame(Name = names, Age = ages, Is_Kid = is_kid)
family family
## Name Age Is_Kid
## 1 Tinky Winky 21 FALSE
## 2 Dipsy 10 TRUE
## 3 Laa Laa 10 TRUE
## 4 Po 2 TRUE
# f) https://www.r-bloggers.com/r-sorting-a-data-frame-by-the-contents-of-a-column/
<- family[order(family$Name),]
sorted_family # g) Logical vector or subset()
$Is_Kid,] sorted_family[sorted_family
## Name Age Is_Kid
## 2 Dipsy 10 TRUE
## 3 Laa Laa 10 TRUE
## 4 Po 2 TRUE
subset(sorted_family, subset = Is_Kid)
## Name Age Is_Kid
## 2 Dipsy 10 TRUE
## 3 Laa Laa 10 TRUE
## 4 Po 2 TRUE
# h) subset using logical vector
$Age > 20,] sorted_family[sorted_family
## Name Age Is_Kid
## 1 Tinky Winky 21 FALSE
subset(sorted_family, subset = Age > 20)
## Name Age Is_Kid
## 1 Tinky Winky 21 FALSE
# i) modify the Age column only
$Age <- sorted_family$Age + 1
sorted_family# j) Using negative indexing or assigning column to NULL
$Is_Kid <- NULL
sorted_family# Or sorted_family <- sorted_family[, - which (colnames(sorted_family) == "Is_Kid")]
4.4 Lists
- A list is another data structure
- It is an ordered collection of objects, which can be vectors, matrices, data frames, etc. In other words, a list can contain all kind of R objects.
- List elements can be accessed by $Name_of_Element or [[index_of_element]]
# Create a list
# Elements can be any type and structure, including vectors and data frames
<- list(
my_family mother = "Veronique",
father = "Michel",
sisters = c("Alicia", "Monica"),
sister_age = c(12, 22)
)# Print
my_family
## $mother
## [1] "Veronique"
##
## $father
## [1] "Michel"
##
## $sisters
## [1] "Alicia" "Monica"
##
## $sister_age
## [1] 12 22
# Names of elements in the list
names(my_family)
## [1] "mother" "father" "sisters" "sister_age"
# Number of elements in the list
length(my_family)
## [1] 4
# Subset a list - select element by its name or its index
# Select by name (1/2)
$father my_family
## [1] "Michel"
# Select by name (2/2)
"father"]] my_family[[
## [1] "Michel"
# Select a specific element of a component
# select the first ([1]) element of my_family[[3]]
"sisters"]][1] my_family[[
## [1] "Alicia"
# Add to list
$brother <- "Toby" my_family
4.5 Factors
- Factor variables represent categories or groups in your data. The function factor() can be used to create a factor variable.
- R orders factor levels alphabetically, so if you want to redefine the order, do it in the factor() function call
# Create a factor variable
<- factor(c(1, 2, 1, 2))
friend_groups friend_groups
## [1] 1 2 1 2
## Levels: 1 2
# Get group names (or levels)
levels(friend_groups)
## [1] "1" "2"
# Change levels
levels(friend_groups) <- c("best_friend", "not_best_friend")
friend_groups
## [1] best_friend not_best_friend best_friend not_best_friend
## Levels: best_friend not_best_friend
# Change the order of levels
<- factor(friend_groups,
friend_groups levels = c("not_best_friend", "best_friend"))
# Print
friend_groups
## [1] best_friend not_best_friend best_friend not_best_friend
## Levels: not_best_friend best_friend
# Check if friend_groups is a factor
is.factor(friend_groups)
## [1] TRUE
# Convert a character_vector as a factor
as.factor(c("A", "B", "D"))
## [1] A B D
## Levels: A B D
4.6 Matrices
- A matrix is a table containing multiple rows and columns of vectors with the same type, which can be either numeric, character or logical.
- To create easily a matrix, use the function cbind() or rbind() and perform similar functions to data frames
- Convert to data frame using as.data.frame()
# Numeric vectors
<- c(5, 6, 7, 8, 9)
col1 <- c(2, 4, 5, 9, 8)
col2 <- c(7, 3, 4, 8, 7)
col3 # Combine the vectors by column
<- cbind(col1, col2, col3)
my_data my_data
## col1 col2 col3
## [1,] 5 2 7
## [2,] 6 4 3
## [3,] 7 5 4
## [4,] 8 9 8
## [5,] 9 8 7
# Change rownames
rownames(my_data) <- c("row1", "row2", "row3", "row4", "row5")
# Transpose
t(my_data)
## row1 row2 row3 row4 row5
## col1 5 6 7 8 9
## col2 2 4 5 9 8
## col3 7 3 4 8 7
# Dimensions
ncol(my_data) # Number of columns
## [1] 3
nrow(my_data) # Number of rows
## [1] 5
dim(my_data) # Number of rows and columns
## [1] 5 3
# Subset by positive indexing
2:4, ] # Select row number 2 to 4 my_data[
## col1 col2 col3
## row2 6 4 3
## row3 7 5 4
## row4 8 9 8
c(2,4), ] # rows 2 and 4 but not 3 my_data[
## col1 col2 col3
## row2 6 4 3
## row4 8 9 8
"col2"] # Select by column 2's name "col2" my_data[,
## row1 row2 row3 row4 row5
## 2 4 5 9 8
# Exclude rows/columns by negative indexing
-1] # Exclude column 1 my_data[,
## col2 col3
## row1 2 7
## row2 4 3
## row3 5 4
## row4 9 8
## row5 8 7
# Perform simple operations on matrice
log2(my_data)
## col1 col2 col3
## row1 2.321928 1.000000 2.807355
## row2 2.584963 2.000000 1.584963
## row3 2.807355 2.321928 2.000000
## row4 3.000000 3.169925 3.000000
## row5 3.169925 3.000000 2.807355
*3 my_data
## col1 col2 col3
## row1 15 6 21
## row2 18 12 9
## row3 21 15 12
## row4 24 27 24
## row5 27 24 21
- You may also construct a matrix using the matrix() function
<- matrix(data = c(1,2,3, 11,12,13),
mdat nrow = 2, byrow = TRUE,
dimnames = list(c("row1", "row2"), c("C.1", "C.2", "C.3")))