Human Genomics

This section provides a brief introduction to fundamental R operations. Follow each step carefully, and ensure you understand what each command does and why it works as expected.

A calculator

You can use R as a powerful calculator to perform standard arithmetic operations using the following mathematical operators:

1 + 3   # Addition

## [1] 4

9 - 3   # Subtraction

## [1] 6

4 * 9   # Multiplication

## [1] 36

20 / 5  # Division

## [1] 4

R follows the conventional mathematical order of operations:

Parentheses,
Exponents,
Multiplication/Division,
Addition/Subtraction.

4 / 2^2 + 2 * 2

## [1] 5

Explanation:

The exponent is calculated first: 2^2 = 4.
Then, division and multiplication are performed from left to right: 4 / 4 = 1 and 2 * 2 = 4.
Finally, addition: 1 + 4 = 5.

Thus, the result of the expression is 5.

You can modify the default calculation order by using parentheses to explicitly define how the operations should be performed:

(((4/2)^2)+2)*2

## [1] 12

Breakdown of the calculation:

4 / 2 = 2 ,
2^2 = 4 ,
4 + 2 = 6 ,
6 * 2 = 12 .

Assigning objects to variables

In R, an object is a fundamental data structure that consists of two key components: data (the actual values stored) and a name (an identifier used to reference the data). Essentially, objects allow you to store, manipulate, and retrieve data efficiently within your R environment.

For example, you can create an object that holds the numbers from 1 to 5 and assign it a name for easy reference:

my_numbers <- c(1, 2, 3, 4, 5)

In this example:

my_numbers is the name of the object.
c(1, 2, 3, 4, 5) is the data, representing a sequence of numbers.

Once the object is created, you can reference it by its name to access or manipulate the stored data:

print(my_numbers)  # Outputs: 1 2 3 4 5

## [1] 1 2 3 4 5

To create an object in In R, you use the assignment operator <- (preferred) or = to assign values to a named variable. For example:

x <- 10  # Assigns the value 10 to the object named 'x'
y = 20   # Assigns the value 20 to the object named 'y' (alternative syntax)

The arrow (<-) is generally preferred for several reasons related to readability, convention, and potential confusion with function arguments.

Readability and Clarity
- The arrow (<-) visually indicates the flow of data from right to left, making it clear that a value is being assigned to an object.
Consistency with R Language Conventions
- The arrow (<-) is the standard assignment operator in R and is used consistently in official documentation, tutorials, and best practices.
- Experienced R programmers and packages predominantly use <-, making it the conventional and expected choice.
- Following this convention ensures consistency and avoids confusion when collaborating with others.
Avoiding Confusion with Function Arguments
- In function calls, = is used to specify argument values. If used for assignment, it can lead to unintended errors or confusion.
Flexibility and Pipe Operator Compatibility
- The tidyverse ecosystem (e.g., dplyr and ggplot2) uses %>% (the pipe operator) heavily, and <- provides better compatibility and readability in such pipelines, e.g., data <- mtcars %>% filter(mpg > 20)

R provides several types of objects to store different kinds of data, including: - Vectors – A sequence of elements of the same type (e.g., numeric, character). - Matrices – A two-dimensional collection of elements. - Data Frames – A table-like structure where columns can hold different data types. - Lists – A flexible container that can hold multiple types of elements.

vec <- c(1, 2, 3)               # Vector
mat <- matrix(1:9, nrow=3)      # Matrix
df <- data.frame(a=1:3, b=4:6)  # Data Frame
lst <- list(name="Alice", age=25, scores=c(80, 90, 85))  # List

Once an object is created, you can:

View its contents using functions like print() or simply typing its name.
Modify it by assigning new values.
Remove it from memory using rm(object_name).

Example of modifying an object:

my_numbers <- c(10, 20, 30)  # Redefining the object with new values
rm(my_numbers)               # Deleting the object

In R, the same result can often be achieved using multiple approaches. A common example is generating a vector containing the numbers from 1 to 10. R provides several ways to accomplish this, each with its own advantages in terms of readability and flexibility.

The c() function, which stands for combine, allows you to manually create a vector by specifying each individual number. This method provides explicit control over the elements in the vector but can be cumbersome for long sequences.

c(1,2,3,4,5,6,7,8,9,10)

##  [1]  1  2  3  4  5  6  7  8  9 10

The : operator provides a concise way to generate a sequence of consecutive integers by specifying the starting and ending values.

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

The seq() function provides greater flexibility for generating sequences by allowing customization of the start, end, and step size.

seq(from=1, to=10, by=1)

##  [1]  1  2  3  4  5  6  7  8  9 10

You can adjust the step size to create sequences with increments other than 1, such as:

seq(from = 1, to = 10, by = 2)

## [1] 1 3 5 7 9

You can then do operations on the entire vector at once. For example, multiple by 2.

x <- 1:10

x*2

##  [1]  2  4  6  8 10 12 14 16 18 20

Introduction to object classes

In R, data is stored in objects, and each object belongs to a specific class, which determines how the data is structured and what operations can be performed on it. Understanding object classes is crucial for efficient data manipulation and analysis in R. The different object types in R have unique ways to access their elements. Understanding these indexing methods allows efficient data retrieval and modification.

Vectors

A vector is the most basic data structure in R, representing a sequence of elements of the same type. Vectors can hold numeric, character, logical, or complex values.

numeric_vec <- c(1, 2, 3.5, 4)
class(numeric_vec)

char_vec <- c("apple", "banana", "cherry")
class(char_vec)  # "character"

Vectors are one-dimensional sequences of elements, and elements can be accessed using square brackets [ ] with an index number.

v <- c(10, 20, 30, 40, 50)

# Access the second element
v[2]  # 20

# Access multiple elements
v[c(1, 3, 5)]  # 10 30 50

# Access elements using logical indexing
v[v > 25]  # 30 40 50

# Exclude elements using negative indexing
v[-3]  # 10 20 40 50

Factors

Factors are categorical variables used to represent discrete data with a fixed number of possible values (levels). Factors are useful for handling data that falls into distinct categories, such as gender, colors, or responses in surveys.

colors <- factor(c("red", "blue", "red", "green"))
class(colors)  # "factor"
levels(colors)  # "blue" "green" "red"

You can manually specify the levels and their order when creating a factor.

education <- factor(c("High School", "Bachelor", "Master", "PhD", "Bachelor"),
                    levels = c("High School", "Bachelor", "Master", "PhD"))

# Print the factor and check its levels
print(education)

## [1] High School Bachelor    Master      PhD         Bachelor   
## Levels: High School Bachelor Master PhD

levels(education)

## [1] "High School" "Bachelor"    "Master"      "PhD"

When the categories have a logical order (e.g., low to high), you can create ordered factors using the ordered = TRUE argument.

satisfaction <- factor(c("Low", "Medium", "High", "Medium", "Low"),
                       levels = c("Low", "Medium", "High"),
                       ordered = TRUE)

print(satisfaction)

## [1] Low    Medium High   Medium Low   
## Levels: Low < Medium < High

Matrices

A matrix is a two-dimensional array where all elements must be of the same type.

matrix_obj <- matrix(1:9, nrow=3, ncol=3)
class(matrix_obj)  # "matrix"

Elements in a matrix can be accessed using row and column indices within [ , ]. Use [row, column] to access specific values.

m <- matrix(1:9, nrow = 3, ncol = 3)

# Access element in the 2nd row, 3rd column
m[2, 3]  # 6

## [1] 8

# Access entire 1st row
m[1, ]  # 1 4 7

## [1] 1 4 7

# Access entire 2nd column
m[, 2]  # 4 5 6

## [1] 4 5 6

Data frames

A data frame is a table-like structure where each column can contain different types of data (numeric, character, etc.). It is one of the most commonly used data structures for handling datasets in R.

df <- data.frame(Name = c("Alice", "Bob"), Age = c(25, 30), Score = c(95.5, 89))
class(df)  # "data.frame"

## [1] "data.frame"

Data frames are similar to tables and can be indexed using row and column indices, column names, row names, or the $ operator.

df <- data.frame(Name = c("Alice", "Bob"), Age = c(25, 30), Score = c(95, 85))

# Access a single column by name
df$Age  # 25 30

## [1] 25 30

# Access the first row, second column
df[1, 2]  # 25

## [1] 25

# Access all rows for 'Score' column
df[, "Score"]  # 95 85

## [1] 95 85

# Filter rows based on condition
df[df$Age > 26, ]  # Returns Bob's data

##   Name Age Score
## 2  Bob  30    85

Lists

A list is a flexible data structure that can store elements of different types, including vectors, matrices, data frames, and even other lists.

my_list <- list(name = "Alice", age = 25, scores = c(90, 85, 88))
class(my_list)  # "list"

Elements in lists are accessed using double square brackets [[ ]] or the dollar sign $.

lst <- list(name = "Alice", age = 25, scores = c(90, 80, 85))

# Access by index
lst[[1]]  # "Alice"

## [1] "Alice"

# Access by name
lst$age  # 25

## [1] 25

# Access nested elements
lst$scores[2]  # 80

## [1] 80

Arrays

Arrays are similar to matrices but can have more than two dimensions. They are used for multi-dimensional data storage.

array_obj <- array(1:12, dim = c(2, 3, 2))
class(array_obj)  # "array"

Arrays are multi-dimensional objects, and elements can be accessed using multiple indices within [ , , ].

a <- array(1:12, dim = c(2, 3, 2))

# Access element from the 1st row, 2nd column, 1st matrix
a[1, 2, 1]  # 3

## [1] 3

# Access the entire 2nd matrix
a[ , , 2]

##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

Introduction to mathematical operations on matrices

Just like vectors, matrices in R allow for straightforward mathematical operations. Since matrices are essentially two-dimensional arrays of numbers, you can perform various element-wise and matrix-specific operations easily. Let’s first create a matrix using the matrix() function.

# Create a 3x3 matrix with numbers 1 to 9
m <- matrix(1:9, nrow = 3, ncol = 3)

# Print the matrix
print(m)

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Operations like addition, subtraction, multiplication, and division can be applied element-wise in matrices, just as in vectors.

# Element-wise addition
m + 2  

# Element-wise subtraction
m - 3  

# Element-wise multiplication
m * 2

# Element-wise division
m / 2

##      [,1] [,2] [,3]
## [1,]  0.5  2.0  3.5
## [2,]  1.0  2.5  4.0
## [3,]  1.5  3.0  4.5

If you have a vector, R automatically applies operations to each row or column of the matrix.

v <- c(1, 2, 3)

# Add vector to each column 
m + v

##      [,1] [,2] [,3]
## [1,]    2    5    8
## [2,]    4    7   10
## [3,]    6    9   12

# Multiply vector with each row
m * v

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    4   10   16
## [3,]    9   18   27

You can perform addition and subtraction on matrices of the same dimensions.

A <- matrix(1:4, nrow = 2)
B <- matrix(5:8, nrow = 2)

# Matrix addition
A + B

##      [,1] [,2]
## [1,]    6   10
## [2,]    8   12

Linear algebra can be extremely useful for improving the speed on a function. One common operation is matrix multiplication, which is done using the %*% operator, not the * operator, which performs element-wise multiplication.

A <- matrix(1:4,ncol=2,nrow=2,byrow=T)
print(A)

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

A*A

##      [,1] [,2]
## [1,]    1    4
## [2,]    9   16

The true matrix multiplication is defined as: \[\begin{equation} \begin{bmatrix} A_{1,1} & A_{1,2} \\ A_{2,1} & A_{2,2} \\ \end{bmatrix} \times \begin{bmatrix} A_{1,1} & A_{1,2} \\ A_{2,1} & A_{2,2} \\ \end{bmatrix} = \begin{bmatrix} \sum_{j=1}^2 (A_{1,j}\times A_{j,1} ) & \sum_{j=1}^2 (A_{1,j}\times A_{j,2} ) \\ \sum_{j=1}^2 (A_{2,j}\times A_{j,1} ) & \sum_{j=1}^2 (A_{2,j}\times A_{j,2} ) \\ \end{bmatrix} \end{equation}\]

Thus, for the small example that would be: \[\begin{equation} \begin{bmatrix} 1 & 3 \\ 2 & 4 \\ \end{bmatrix} \times \begin{bmatrix} 1 & 3 \\ 2 & 4 \\ \end{bmatrix} = \begin{bmatrix} (1 \times 1)+(2 \times 3) & (1 \times 2)+(2 \times 4) \\ (3 \times 1)+(4 \times 3) & (3 \times 2)+(4 \times 4) \\ \end{bmatrix} = \begin{bmatrix} 7 & 10 \\ 15 & 22 \\ \end{bmatrix} \end{equation}\]

A%*%A

##      [,1] [,2]
## [1,]    7   10
## [2,]   15   22

The transpose of a matrix (switching rows and columns) is done using the t() function.

t(A)

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

You can compute the determinant (thus, if det(A) !=0 the matrix is invertible) and inverse of a square matrix using the det() and solve() functions, respectively.

# Determinant of matrix A
det(A)

## [1] -2

# Inverse of matrix A
solve(A)

##      [,1] [,2]
## [1,] -2.0  1.0
## [2,]  1.5 -0.5

You can compute row-wise and column-wise summaries easily using functions like rowSums() and colSums().

# Sum of each row
rowSums(m)

## [1] 12 15 18

# Sum of each column
colSums(m)

## [1]  6 15 24

Other functions like rowMeans() and colMeans() calculate averages.

You can apply custom functions to rows or columns using the apply() function,

# Calculate maximum value in each row
apply(m, 1, max)

## [1] 7 8 9

# Calculate sum of each column
apply(m, 2, sum)

## [1]  6 15 24

where the second argument specifies the direction (1 for rows, 2 for columns).

Using and writing functions

There exists many pre-defined functions within R, but you can also create your own. For example, you can calculate the mean (average) and standard deviation (SD) of a set of numbers using both built-in functions and custom user-defined functions.

Suppose that we have measured the height for $n=5$ individuals:

y <- c(173, 184, 168, 177,187) # vector of observations
n <- length(y)                 # store how many observations we have

The mean ($\bar{x}$) height is the sum of the values divided by the number of observations, \[\bar{x} = \frac{1}{n} \sum_{i=1}^n \textrm{y}_i = \frac{173 + 184 + 168 + 177 + 187}{5}=177.8\] We can use the buit-in function mean() to compute it:

# Calculate mean
mean_value <- mean(y)
print(mean_value)

## [1] 177.8

We can define our own function for calculating the mean of height:

calculate_mean <- function(x) {
  sum(x) / length(x)
}

# Test the function
calculate_mean(y)

## [1] 177.8

To have an idea how much the values of height vary around the mean we can use standard deviation, that is, the square root of variance. The variance is the average squared deviation from the mean. When variance is computed from a sample of size $n$, whose mean is first estimated from that same sample, then the denominator in variance calculation is $n-1$ rather than $n$, so \[\begin{align} \textrm{variance} &= \frac{1}{n-1}\sum_{i=1}^n(\textrm{y}_i-\bar{x})^2 \notag \\ &= \frac{(173-177.8)^2 + (184-177.8)^2 + (168-177.8)^2 + (177-177.8)^2 + (187-177.8)^2}{4} \notag \\ &= 60.7 \notag \\ \end{align}\]

This can be done as:

sum((y - mean(y))^2) / (n - 1)

## [1] 60.7

or using the built-in var()-function.

var(y)

## [1] 60.7

Making figures in R

R provides powerful tools for creating visualizations to explore and present data effectively. There are several ways to make plots in R, ranging from basic built-in plotting functions to advanced graphics packages like ggplot2.

Basic Plots Using Base R

Base R offers simple functions to create quick visualizations. The most commonly used plotting function is plot(), which can generate scatter plots, line plots, and more.

# Sample data
x <- c(1, 2, 3, 4, 5)
y <- c(3, 7, 8, 12, 15)

# Create a scatter plot
plot(x, y, 
     main = "Basic Scatter Plot", 
     xlab = "X-axis", 
     ylab = "Y-axis", 
     col = "blue", 
     pch = 16)

Key Parameters:

main adds a title to the plot.
xlab and ylab labels the axes.
col sets color for points.
pch specifies point style (e.g., circles, squares).

Using type = "l" creates a line plot instead of points, where the argument lwd adjusts line width.

plot(x, y, type = "l", col = "red", lwd = 2, main = "Line Plot")

Other common type of plots include bar plot and histogram:

# Sample data
categories <- c("A", "B", "C", "D")
values <- c(10, 15, 7, 12)

# Create a bar plot
barplot(values, 
        names.arg = categories, 
        col = "lightblue", 
        main = "Bar Plot", 
        xlab = "Categories", 
        ylab = "Values")

# Generate random data
data <- rnorm(100)

# Create a histogram
hist(data, 
     col = "lightgreen", 
     main = "Histogram", 
     xlab = "Values", 
     breaks = 10)

The breaks-argument in hist() defines the number of bins.

Advanced plots using ggplot2

The ggplot2 package provides a more advanced and flexible way to create plots using the grammar of graphics approach. To use ggplot2, install and load the package first:

install.packages("ggplot2")  # Install package if not already installed; only do once
library(ggplot2)             # Load the package

# Sample data
df <- data.frame(x = 1:5, y = c(3, 7, 8, 12, 15))

# Create a scatter plot
ggplot(df, aes(x = x, y = y)) +
  geom_point(color = "blue", size = 3) +
  ggtitle("ggplot2 Scatter Plot") +
  xlab("X-axis") + 
  ylab("Y-axis")

ggplot(df, aes(x = x, y = y)) +
  geom_line(color = "red", size = 1) +
  ggtitle("ggplot2 Line Plot")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

category_data <- data.frame(category = c("A", "B", "C", "D"),
                            value = c(10, 15, 7, 12))

ggplot(category_data, aes(x = category, y = value, fill = category)) +
  geom_bar(stat = "identity") +
  ggtitle("ggplot2 Bar Plot")

ggplot(data.frame(data), aes(x = data)) +
  geom_histogram(fill = "lightblue", bins = 15) +
  ggtitle("ggplot2 Histogram")

Both Base R and ggplot2 allow customizations.

ggplot(df, aes(x, y)) +
  geom_point(color = "darkgreen", size = 4) +
  theme_minimal() +
  labs(title = "Customized ggplot2 Plot", x = "X values", y = "Y values")