This section provides a brief introduction to fundamental R operations. Follow each step carefully, and ensure you understand what each command does and why it works as expected.
You can use R as a powerful calculator to perform standard arithmetic operations using the following mathematical operators:
1 + 3 # Addition
## [1] 4
9 - 3 # Subtraction
## [1] 6
4 * 9 # Multiplication
## [1] 36
20 / 5 # Division
## [1] 4
R follows the conventional mathematical order of operations:
4 / 2^2 + 2 * 2
## [1] 5
Explanation:
2^2 = 4
.4 / 4 = 1
and 2 * 2 = 4
.1 + 4 = 5
.Thus, the result of the expression is 5
.
You can modify the default calculation order by using parentheses to explicitly define how the operations should be performed:
(((4/2)^2)+2)*2
## [1] 12
Breakdown of the calculation:
4 / 2 = 2
,2^2 = 4
,4 + 2 = 6
,6 * 2 = 12
.In R, an object is a fundamental data structure that consists of two key components: data (the actual values stored) and a name (an identifier used to reference the data). Essentially, objects allow you to store, manipulate, and retrieve data efficiently within your R environment.
For example, you can create an object that holds the numbers from 1 to 5 and assign it a name for easy reference:
my_numbers <- c(1, 2, 3, 4, 5)
In this example:
my_numbers
is the name of the
object.c(1, 2, 3, 4, 5)
is the data,
representing a sequence of numbers.Once the object is created, you can reference it by its name to access or manipulate the stored data:
print(my_numbers) # Outputs: 1 2 3 4 5
## [1] 1 2 3 4 5
To create an object in In R, you use the assignment
operator <-
(preferred) or
=
to assign values to a named variable. For example:
x <- 10 # Assigns the value 10 to the object named 'x'
y = 20 # Assigns the value 20 to the object named 'y' (alternative syntax)
The arrow (<-
) is generally preferred for several
reasons related to readability, convention, and potential confusion with
function arguments.
<-
) visually indicates the flow of data
from right to left, making it clear that a value is being assigned to an
object.<-
) is the standard assignment operator
in R and is used consistently in official documentation,
tutorials, and best practices.<-
, making it the conventional and expected choice.=
is used to specify argument
values. If used for assignment, it can lead to unintended errors or
confusion.dplyr
and
ggplot2
) uses %>%
(the pipe operator)
heavily, and <-
provides better compatibility and
readability in such pipelines, e.g.,
data <- mtcars %>% filter(mpg > 20)
R provides several types of objects to store different kinds of data, including: - Vectors – A sequence of elements of the same type (e.g., numeric, character). - Matrices – A two-dimensional collection of elements. - Data Frames – A table-like structure where columns can hold different data types. - Lists – A flexible container that can hold multiple types of elements.
vec <- c(1, 2, 3) # Vector
mat <- matrix(1:9, nrow=3) # Matrix
df <- data.frame(a=1:3, b=4:6) # Data Frame
lst <- list(name="Alice", age=25, scores=c(80, 90, 85)) # List
Once an object is created, you can:
print()
or
simply typing its name.rm(object_name)
.Example of modifying an object:
my_numbers <- c(10, 20, 30) # Redefining the object with new values
rm(my_numbers) # Deleting the object
In R, the same result can often be achieved using multiple approaches. A common example is generating a vector containing the numbers from 1 to 10. R provides several ways to accomplish this, each with its own advantages in terms of readability and flexibility.
The c()
function, which stands for combine,
allows you to manually create a vector by specifying each individual
number. This method provides explicit control over the elements in the
vector but can be cumbersome for long sequences.
c(1,2,3,4,5,6,7,8,9,10)
## [1] 1 2 3 4 5 6 7 8 9 10
The :
operator provides a concise way to generate a
sequence of consecutive integers by specifying the starting and ending
values.
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
The seq()
function provides greater flexibility for
generating sequences by allowing customization of the start, end, and
step size.
seq(from=1, to=10, by=1)
## [1] 1 2 3 4 5 6 7 8 9 10
You can adjust the step size to create sequences with increments other than 1, such as:
seq(from = 1, to = 10, by = 2)
## [1] 1 3 5 7 9
You can then do operations on the entire vector at once. For example, multiple by 2.
x <- 1:10
x*2
## [1] 2 4 6 8 10 12 14 16 18 20
In R, data is stored in objects, and each object belongs to
a specific class, which determines how the data is structured and what
operations can be performed on it. Understanding object classes is
crucial for efficient data manipulation and analysis in R. The different
object types in R have unique ways to access their elements.
Understanding these indexing methods allows efficient data retrieval and
modification.
A vector is the most basic data structure in R, representing a sequence of elements of the same type. Vectors can hold numeric, character, logical, or complex values.
numeric_vec <- c(1, 2, 3.5, 4)
class(numeric_vec)
char_vec <- c("apple", "banana", "cherry")
class(char_vec) # "character"
Vectors are one-dimensional sequences of elements, and elements can
be accessed using square brackets [ ]
with an index
number.
v <- c(10, 20, 30, 40, 50)
# Access the second element
v[2] # 20
# Access multiple elements
v[c(1, 3, 5)] # 10 30 50
# Access elements using logical indexing
v[v > 25] # 30 40 50
# Exclude elements using negative indexing
v[-3] # 10 20 40 50
Factors are categorical variables used to represent discrete data with a fixed number of possible values (levels). Factors are useful for handling data that falls into distinct categories, such as gender, colors, or responses in surveys.
colors <- factor(c("red", "blue", "red", "green"))
class(colors) # "factor"
levels(colors) # "blue" "green" "red"
You can manually specify the levels and their order when creating a factor.
education <- factor(c("High School", "Bachelor", "Master", "PhD", "Bachelor"),
levels = c("High School", "Bachelor", "Master", "PhD"))
# Print the factor and check its levels
print(education)
## [1] High School Bachelor Master PhD Bachelor
## Levels: High School Bachelor Master PhD
levels(education)
## [1] "High School" "Bachelor" "Master" "PhD"
When the categories have a logical order (e.g., low to high), you can
create ordered factors using the ordered = TRUE
argument.
satisfaction <- factor(c("Low", "Medium", "High", "Medium", "Low"),
levels = c("Low", "Medium", "High"),
ordered = TRUE)
print(satisfaction)
## [1] Low Medium High Medium Low
## Levels: Low < Medium < High
A matrix is a two-dimensional array where all elements must be of the same type.
matrix_obj <- matrix(1:9, nrow=3, ncol=3)
class(matrix_obj) # "matrix"
Elements in a matrix can be accessed using row and column indices
within [ , ]
. Use [row, column]
to access
specific values.
m <- matrix(1:9, nrow = 3, ncol = 3)
# Access element in the 2nd row, 3rd column
m[2, 3] # 6
## [1] 8
# Access entire 1st row
m[1, ] # 1 4 7
## [1] 1 4 7
# Access entire 2nd column
m[, 2] # 4 5 6
## [1] 4 5 6
A data frame is a table-like structure where each column can contain different types of data (numeric, character, etc.). It is one of the most commonly used data structures for handling datasets in R.
df <- data.frame(Name = c("Alice", "Bob"), Age = c(25, 30), Score = c(95.5, 89))
class(df) # "data.frame"
## [1] "data.frame"
Data frames are similar to tables and can be indexed using row and
column indices, column names, row names, or the $
operator.
df <- data.frame(Name = c("Alice", "Bob"), Age = c(25, 30), Score = c(95, 85))
# Access a single column by name
df$Age # 25 30
## [1] 25 30
# Access the first row, second column
df[1, 2] # 25
## [1] 25
# Access all rows for 'Score' column
df[, "Score"] # 95 85
## [1] 95 85
# Filter rows based on condition
df[df$Age > 26, ] # Returns Bob's data
## Name Age Score
## 2 Bob 30 85
A list is a flexible data structure that can store elements of different types, including vectors, matrices, data frames, and even other lists.
my_list <- list(name = "Alice", age = 25, scores = c(90, 85, 88))
class(my_list) # "list"
Elements in lists are accessed using double square brackets
[[ ]]
or the dollar sign $
.
lst <- list(name = "Alice", age = 25, scores = c(90, 80, 85))
# Access by index
lst[[1]] # "Alice"
## [1] "Alice"
# Access by name
lst$age # 25
## [1] 25
# Access nested elements
lst$scores[2] # 80
## [1] 80
Arrays are similar to matrices but can have more than two dimensions. They are used for multi-dimensional data storage.
array_obj <- array(1:12, dim = c(2, 3, 2))
class(array_obj) # "array"
Arrays are multi-dimensional objects, and elements can be accessed
using multiple indices within [ , , ]
.
a <- array(1:12, dim = c(2, 3, 2))
# Access element from the 1st row, 2nd column, 1st matrix
a[1, 2, 1] # 3
## [1] 3
# Access the entire 2nd matrix
a[ , , 2]
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
Just like vectors, matrices in R allow for straightforward
mathematical operations. Since matrices are essentially two-dimensional
arrays of numbers, you can perform various element-wise and
matrix-specific operations easily. Let’s first create a matrix using the
matrix()
function.
# Create a 3x3 matrix with numbers 1 to 9
m <- matrix(1:9, nrow = 3, ncol = 3)
# Print the matrix
print(m)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Operations like addition, subtraction, multiplication, and division can be applied element-wise in matrices, just as in vectors.
# Element-wise addition
m + 2
# Element-wise subtraction
m - 3
# Element-wise multiplication
m * 2
# Element-wise division
m / 2
## [,1] [,2] [,3]
## [1,] 0.5 2.0 3.5
## [2,] 1.0 2.5 4.0
## [3,] 1.5 3.0 4.5
If you have a vector, R automatically applies operations to each row or column of the matrix.
v <- c(1, 2, 3)
# Add vector to each column
m + v
## [,1] [,2] [,3]
## [1,] 2 5 8
## [2,] 4 7 10
## [3,] 6 9 12
# Multiply vector with each row
m * v
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 4 10 16
## [3,] 9 18 27
You can perform addition and subtraction on matrices of the same dimensions.
A <- matrix(1:4, nrow = 2)
B <- matrix(5:8, nrow = 2)
# Matrix addition
A + B
## [,1] [,2]
## [1,] 6 10
## [2,] 8 12
Linear algebra can be extremely useful for improving the speed on a
function. One common operation is matrix multiplication, which is done
using the %*%
operator, not the *
operator,
which performs element-wise multiplication.
A <- matrix(1:4,ncol=2,nrow=2,byrow=T)
print(A)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
A*A
## [,1] [,2]
## [1,] 1 4
## [2,] 9 16
The true matrix multiplication is defined as: \[\begin{equation} \begin{bmatrix} A_{1,1} & A_{1,2} \\ A_{2,1} & A_{2,2} \\ \end{bmatrix} \times \begin{bmatrix} A_{1,1} & A_{1,2} \\ A_{2,1} & A_{2,2} \\ \end{bmatrix} = \begin{bmatrix} \sum_{j=1}^2 (A_{1,j}\times A_{j,1} ) & \sum_{j=1}^2 (A_{1,j}\times A_{j,2} ) \\ \sum_{j=1}^2 (A_{2,j}\times A_{j,1} ) & \sum_{j=1}^2 (A_{2,j}\times A_{j,2} ) \\ \end{bmatrix} \end{equation}\]
Thus, for the small example that would be: \[\begin{equation} \begin{bmatrix} 1 & 3 \\ 2 & 4 \\ \end{bmatrix} \times \begin{bmatrix} 1 & 3 \\ 2 & 4 \\ \end{bmatrix} = \begin{bmatrix} (1 \times 1)+(2 \times 3) & (1 \times 2)+(2 \times 4) \\ (3 \times 1)+(4 \times 3) & (3 \times 2)+(4 \times 4) \\ \end{bmatrix} = \begin{bmatrix} 7 & 10 \\ 15 & 22 \\ \end{bmatrix} \end{equation}\]
A%*%A
## [,1] [,2]
## [1,] 7 10
## [2,] 15 22
The transpose of a matrix (switching rows and columns) is done using
the t()
function.
t(A)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
You can compute the determinant (thus, if det(A) !=0
the
matrix is invertible) and inverse of a square matrix using the
det()
and solve()
functions, respectively.
# Determinant of matrix A
det(A)
## [1] -2
# Inverse of matrix A
solve(A)
## [,1] [,2]
## [1,] -2.0 1.0
## [2,] 1.5 -0.5
You can compute row-wise and column-wise summaries easily using
functions like rowSums()
and colSums()
.
# Sum of each row
rowSums(m)
## [1] 12 15 18
# Sum of each column
colSums(m)
## [1] 6 15 24
Other functions like rowMeans()
and
colMeans()
calculate averages.
You can apply custom functions to rows or columns using the
apply()
function,
# Calculate maximum value in each row
apply(m, 1, max)
## [1] 7 8 9
# Calculate sum of each column
apply(m, 2, sum)
## [1] 6 15 24
where the second argument specifies the direction (1
for
rows, 2
for columns).
There exists many pre-defined functions within R, but you can also create your own. For example, you can calculate the mean (average) and standard deviation (SD) of a set of numbers using both built-in functions and custom user-defined functions.
Suppose that we have measured the height for \(n=5\) individuals:
y <- c(173, 184, 168, 177,187) # vector of observations
n <- length(y) # store how many observations we have
The mean (\(\bar{x}\)) height is the sum of the values
divided by the number of observations, \[\bar{x} = \frac{1}{n} \sum_{i=1}^n \textrm{y}_i =
\frac{173 + 184 + 168 + 177 + 187}{5}=177.8\] We can use the
buit-in function mean()
to compute it:
# Calculate mean
mean_value <- mean(y)
print(mean_value)
## [1] 177.8
We can define our own function for calculating the mean of height:
calculate_mean <- function(x) {
sum(x) / length(x)
}
# Test the function
calculate_mean(y)
## [1] 177.8
To have an idea how much the values of height vary around the mean we can use standard deviation, that is, the square root of variance. The variance is the average squared deviation from the mean. When variance is computed from a sample of size \(n\), whose mean is first estimated from that same sample, then the denominator in variance calculation is \(n-1\) rather than \(n\), so \[\begin{align} \textrm{variance} &= \frac{1}{n-1}\sum_{i=1}^n(\textrm{y}_i-\bar{x})^2 \notag \\ &= \frac{(173-177.8)^2 + (184-177.8)^2 + (168-177.8)^2 + (177-177.8)^2 + (187-177.8)^2}{4} \notag \\ &= 60.7 \notag \\ \end{align}\]
This can be done as:
sum((y - mean(y))^2) / (n - 1)
## [1] 60.7
or using the built-in var()
-function.
var(y)
## [1] 60.7
R provides powerful tools for creating visualizations to explore and present data effectively. There are several ways to make plots in R, ranging from basic built-in plotting functions to advanced graphics packages like ggplot2.
Base R offers simple functions to create quick
visualizations. The most commonly used plotting function is
plot()
, which can generate scatter plots, line plots, and
more.
# Sample data
x <- c(1, 2, 3, 4, 5)
y <- c(3, 7, 8, 12, 15)
# Create a scatter plot
plot(x, y,
main = "Basic Scatter Plot",
xlab = "X-axis",
ylab = "Y-axis",
col = "blue",
pch = 16)
Key Parameters:
main
adds a title to the plot.xlab
and ylab
labels the axes.col
sets color for points.pch
specifies point style (e.g., circles,
squares).Using type = "l"
creates a line plot instead of points,
where the argument lwd
adjusts line width.
plot(x, y, type = "l", col = "red", lwd = 2, main = "Line Plot")
Other common type of plots include bar plot and histogram:
# Sample data
categories <- c("A", "B", "C", "D")
values <- c(10, 15, 7, 12)
# Create a bar plot
barplot(values,
names.arg = categories,
col = "lightblue",
main = "Bar Plot",
xlab = "Categories",
ylab = "Values")
# Generate random data
data <- rnorm(100)
# Create a histogram
hist(data,
col = "lightgreen",
main = "Histogram",
xlab = "Values",
breaks = 10)
The breaks
-argument in hist()
defines the
number of bins.
The ggplot2
package provides a more advanced and
flexible way to create plots using the grammar of graphics approach. To
use ggplot2
, install and load the package first:
install.packages("ggplot2") # Install package if not already installed; only do once
library(ggplot2) # Load the package
# Sample data
df <- data.frame(x = 1:5, y = c(3, 7, 8, 12, 15))
# Create a scatter plot
ggplot(df, aes(x = x, y = y)) +
geom_point(color = "blue", size = 3) +
ggtitle("ggplot2 Scatter Plot") +
xlab("X-axis") +
ylab("Y-axis")
ggplot(df, aes(x = x, y = y)) +
geom_line(color = "red", size = 1) +
ggtitle("ggplot2 Line Plot")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
category_data <- data.frame(category = c("A", "B", "C", "D"),
value = c(10, 15, 7, 12))
ggplot(category_data, aes(x = category, y = value, fill = category)) +
geom_bar(stat = "identity") +
ggtitle("ggplot2 Bar Plot")
ggplot(data.frame(data), aes(x = data)) +
geom_histogram(fill = "lightblue", bins = 15) +
ggtitle("ggplot2 Histogram")
Both Base R and ggplot2
allow
customizations.
ggplot(df, aes(x, y)) +
geom_point(color = "darkgreen", size = 4) +
theme_minimal() +
labs(title = "Customized ggplot2 Plot", x = "X values", y = "Y values")