In these practical sessions, we will analyze quantitative traits in a mouse population. The dataset includes phenotypic measurements related to growth and obesity, such as body weight and blood glucose levels, along with genetic marker data.
The M16 mouse was developed as an outbred model for early-onset polygenic obesity and diabesity. This was achieved through 27 generations of selective breeding for weight gain between 3 and 6 weeks of age, starting from an outbred ICR base population. The breeding strategy involved selecting the male and female with the highest weight gain within each litter. A parallel ICR control line was maintained through random mating while preserving a similar effective population size. Compared to ICR mice, M16 mice are consistently larger at all ages and exhibit increased body fat percentage, fat cell size and number, as well as larger organ weights. They also display hyperphagia, moderate obesity, and metabolic abnormalities, including hyperglycemia, hyperinsulinemia, and hypercholesterolemia.
The ICR mouse is an albino strain originally derived from SWISS mice and selectively bred by Dr. Hauschka to establish a highly fertile line. Named after the Institute of Cancer Research (ICR) in the USA, where it was developed, this strain has been widely distributed for research. ICR mice are relatively large, docile, and grow well, making them a widely used general-purpose model in various research fields, including toxicity studies, pharmacology, drug efficacy, and immunology.
A large F2 population (n=1181) was established by crossing the M16 and ICR lines. Twelve F1 families resulted from six pair matings of M16 males x ICR females and six pair matings of the reciprocal cross. A total of 55 F1 dams were mated to 11 F1 sires in sets of five F1 full sisters mated to the same F1 sire. These same specific matings were repeated in three consecutive replicates. Thus, the F2 population consisted of 55 full-sib families of up to 24 individuals each and 11 sire families families of up to 120 individuals each. Actual numbers of mice within families varied slightly due to a small number of failed pregnancies. All litters were standardized at birth to eight pups, with approximately equal representation of males and females.
More information about the mouse data can be found in the following publications:
Allan, M.F., Eisen, E.J. and Pomp, D. (2004). The M16 Mouse: An Outbred Animal Model of Early Onset Polygenic Obesity and Diabesity. Obesity Research, 12: 1397-1407. https://doi.org/10.1038/oby.2004.176
Allan, M. F., Eisen E. J and Pomp, D. (2005). Genomic Mapping of Direct and Correlated Responses to Long-Term Selection for Rapid Growth Rate in Mice. Genetics, 170(4): 1863–1877. https://doi.org/10.1534/genetics.105.041319
In this practical we will estimate genetic parameters (heritability)
for quantitative traits observed in the F2 mouse population. We will be
using the REML method. This method allow for estimation of genetic
parameters using phenotypic information for individuals from a general
pedigree. REML is based on linear mixed model methodology and uses a
likelihood approach to estimate genetic parameters. The REML method also
require us to calculate an genetic relationship matrix using a recursive
algorithm. These methods and algorithms are implemented in the R package
qgg
.
This package provides an infrastructure for efficient processing of large-scale genetic and phenotypic data including core functions for:
We will also be using the qgg package for the remaining practicals.
You can install qgg from CRAN with:
install.packages("qgg")
library(qgg) # R package used for REML analysis
#install.packages("corrplot")
library(corrplot)
The mouse data set can be loaded using the following command:
mouse <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mouseqtl.rds"))
The mouse pedigree is loaded in a similar way using the following command:
pedigree <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mousepedigree.rds"))
Use the str
function to get a fast overview of the
pedigree you are working.
Answer:
str(pedigree)
## 'data.frame': 1267 obs. of 6 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sire : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dam : int 0 0 0 0 0 0 0 0 0 0 ...
## $ family : Factor w/ 68 levels "0/0","1/2","11/12",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : chr "Male" "Female" "Male" "Female" ...
## $ generation: chr "M6" "IC" "IC" "M6" ...
Answer:
nrow(pedigree)
## [1] 1267
dim(pedigree)
## [1] 1267 6
Use the table
function on the generation variable.
Answer:
table(pedigree$generation)
##
## F1 F2 IC M6
## 66 1177 12 12
The REML analysis require us to calculate the genetic relationship matrix \(A\). This is done using information about the id, mother, and father which is available in our pedigree data file.
To illustrate this step we will first calculate it for a small part of the mouse pedigree. We are given the following pedigree and we want to compute the matrix \(A\).
family <- c(13,14,84,1244,1248)
pedigree[family,]
## id sire dam family sex generation
## 13 13 0 0 0/0 Male IC
## 14 14 0 0 0/0 Female M6
## 84 84 13 14 13/14 Female F1
## 1244 1244 78 84 78/84 Female F2
## 1248 1248 78 84 78/84 Male F2
The additive genetic relationship (\(A_{ij}\)) between the various sources (j) and the individual itself, i.e. the candidate to be evaluated (i), can be seen in the table below.
Relative | \(A_{i,j}\) |
---|---|
Self | 1.0 |
Unrelated | 0 |
Mother | 0.5 |
Father | 0.5 |
Grandparent | 0.25 |
Half-sib | 0.25 |
Full-sib | 0.5 |
Progeny | 0.5 |
Next we will compute the genetic relationship matrix for the entire
mouse pedigree. The matrix \(A\) can be
computed using a recursive algorithm implemented in the function
grm
from the qgg package. Use the command below to compute
the genetic relationship matrix for the mouse pedigree:
A <- grm(pedigree=pedigree)
Answer:
dim(A)
## [1] 1267 1267
The number of rows and columns should be equal to the number of individuals in the pedigree. Check the first 5 individuals in the matrix using the following command:
A[1:5,1:5]
## 1 2 3 4 5
## 1 1 0 0 0 0
## 2 0 1 0 0 0
## 3 0 0 1 0 0
## 4 0 0 0 1 0
## 5 0 0 0 0 1
Answer:
Previously we have determined the genetic relationship matrix for a small part of the mouse pedigree. We can extract the corresponding elements from the \(A\) matrix for the entire mouse pedigree using the following command:
ids <- c(13,14,84,1244,1248)
A[ids,ids]
## 13 14 84 1244 1248
## 13 1.00 0.00 0.5 0.25 0.25
## 14 0.00 1.00 0.5 0.25 0.25
## 84 0.50 0.50 1.0 0.50 0.50
## 1244 0.25 0.25 0.5 1.00 0.50
## 1248 0.25 0.25 0.5 0.50 1.00
Answer:
Make a plot of the genetic relationship matrix using the
corrplot
function from the corrplot R package:
corrplot(A, method="color", bg="white", col= colorRampPalette(c("white", "#62AFD7FF", "#233253FF"))(10) ,
tl.pos="n", outline=FALSE, xlab=FALSE, ylab=FALSE, is.corr=FALSE)
Answer:
The next step is to prepare the linear mixed model for the mouse
data. Recall that the linear mixed model contains the observation vector
for the trait(s) of interest (\(y\)),
the fixed effects
that explain systematic differences in
\(y\), and the
random genetic effects
\(a\) and random residual effects \(e\).
A matrix formulation of a general model equation is: \[\begin{align} y &= Xb + a + e \notag \end{align}\]
where \[\begin{align} y &: \text{is the vector of observed values of the trait,} \notag \\ b &: \text{is a vector of fixed effects,} \notag \\ a &: \text{is a vector of random genetic effects,} \notag \\ e &: \text{is a vector of random residual effects,} \notag \\ X &: \text{is a known design matrix that relates the elements of b to their corresponding element in y.} \notag \end{align}\]
In the statistical model (specified above) the random effects (\(a\) and \(e\)) and the phenotypes (\(y\)) are considered to be random variables
which follow a multivariate normal distribution: In general terms the
expectations of these random variables are:
\[\begin{align}
E(y) &= Xb \notag \\
E(a) &= 0 \notag \\
E(e) &= 0 \notag \\
\end{align}\] and the variance-covariance matrices are: \[\begin{align}
Var(a) &= A\sigma_a^2 \notag \\
Var(e) &= I\sigma_e^2 \notag \\
Var(y) &= A\sigma_a^2 + I\sigma_e^2 \notag
\end{align}\]
where \(A\sigma_a^2\), and \(I\sigma_e^2\) are square matrices of genetic and residual (co)variances among the individuals, respectively. In the previous section we have allready constructed the genetic relationship matrix \(A\).
In order to perform the REML analysis we need to construct \(y\) and \(X\) from the mouse data. Let us just have a quick look at the mouse data again:
str(mouse)
## 'data.frame': 1177 obs. of 8 variables:
## $ sire : Factor w/ 11 levels "25","28","34",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ dam : Factor w/ 55 levels "26","27","29",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 2 2 2 1 1 ...
## $ reps : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ Gl : num 187 136 115 125 112 190 169 159 111 89 ...
## $ BW : num 36.6 33.3 42.1 37.1 38.4 ...
## $ M227 : Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 1 2 1 2 2 ...
## $ M1139: Factor w/ 3 levels "AA","AB","BB": 3 NA 1 1 1 2 3 3 2 2 ...
Here we will estimate the heritability for body weight. The vector of observed trait values for body weight can be extracted from the mouse data as follows:
y <- mouse[,"BW"]
Let us explore the trait values using the head
,
tail
and summary
functions:
head(y)
## [1] 36.65 33.29 42.07 37.15 38.39 39.82
tail(y)
## [1] 39.67 39.35 44.80 52.23 47.63 54.10
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.04 34.06 38.32 38.72 43.40 60.28
To make the \(X\) matrix we need to decide which variables we should include as fixed effects in the model. We have sex, reps, sire, dam, M227 and M1139 in the mouse data frame.
Answer:
The model.matrix
function can be used to construct the
\(X\) matrix in the linear mixed model
specified above:
X <- model.matrix(BW ~ sex + reps, data=mouse)
We can use the head
and tail
functions to
look at the \(X\) matrix:
head(X)
## (Intercept) sexMale reps2 reps3
## 91 1 0 0 0
## 92 1 0 0 0
## 93 1 0 0 0
## 94 1 0 0 0
## 95 1 1 0 0
## 96 1 1 0 0
tail(X)
## (Intercept) sexMale reps2 reps3
## 1262 1 0 0 1
## 1263 1 0 0 1
## 1264 1 1 0 1
## 1265 1 1 0 1
## 1266 1 1 0 1
## 1267 1 1 0 1
The goal of the REML analysis to estimate the parameters (i.e. variance components \(\sigma_{a}^2\) and \(\sigma_{e}^2\)) in the linear mixed model specified above. In this analysis we find the set of parameters which maximizes the likelihood of the data, i.e., the probability of observations given the model and its parameter estimates: \(p(y|\hat{b}, \hat{\sigma}^2_{a}, \hat{\sigma}^2_{e})\).
The input required the vector of observed values of the trait (\(y\)), the deisgn matrix for the fixed effects (\(X\)), and the genetic relationship matrix (\(A\)). The \(A\) matrix calculated previously include genetic relationships for all individuals in the pedigree. However only a subset of the inviduals have phenotypes recorded for body weight. Therefore we need to subset the \(A\) matrix as shown in the R code below:
ids <- rownames(X)
A <- A[ids,ids]
The REML method is implemented in the greml
function
from the “qgg” package. The REML analysis is done using the following
command:
fit <- greml(y=y,X=X, GRM=list(A=A))
The fit object (i.e., output from the greml
function)
contains estimates of variance components, fixed and random effects,
first and second derivatives of log-likelihood, and the asymptotic
standard deviation of parameter estimates.
Our main interest is the variance components \(\sigma_a^2\) and \(\sigma_e^2\) which are in the
fit$theta
slot of the fit. The following commands extract
and makes a barplot of the estimates of the variance components:
fit$theta
## A E
## 6.569611 13.384147
barplot(fit$theta)
The first element in the theta
vector is the estimate of
the additive genetic variance (\(\hat{\sigma}^2_{a}\)) and the second
element is the estimate of the residual variance (\(\hat{\sigma}^2_{e}\)).
Va <- fit$theta[1]
Ve <- fit$theta[2]
Va
## A
## 6.569611
Ve
## E
## 13.38415
From the REML estimate of the variance components, the heritability can easily be computed by: \[\begin{align} h^2 &= \sigma^2_{a}/(\sigma^2_a+\sigma^2_e) \end{align}\]
Answer:
Va/(Va+Ve)
## A
## 0.3292418
In the experiment the mice were feed ad libitum. Now we want to perform a simlar experiment where mice are reared under restricted feed intake, We will record phenotypes for body weight and blood glucose levels and use mice from the same F2 population.
Answer:
Answer:
y <- mouse[,"Gl"]
X <- model.matrix(Gl ~ sex + reps, data=mouse)
ids <- rownames(X)
A <- grm(pedigree=pedigree)
A <- A[ids,ids]
fit <- greml(y=y,X=X, GRM=list(A=A))
Va <- fit$theta[1]
Ve <- fit$theta[2]
Va/(Va+Ve)
## A
## 0.4058474
In this practical we will estimate additive genetic values for
quantitative traits in the mouse population. We will be using the BLUP
method. This method allow for estimation of additive genetic values
using phenotypic information for individuals from a general pedigree.
BLUP is based on linear mixed model methodology and estimates of
additive genetic values can be obtained by solving the mixed model
equations. The BLUP method also require a genetic relationship matrix
and estimates of variance components (e.g., \(\sigma_a^2\) and \(\sigma_e^2\)). These methods and algorithms
are implemented in the R package qgg
introduced
previously.
Use the following code to load the qgg package:
library(qgg) # R package used for REML/BLUP analysis
The mouse data and pedigree set can be loaded using the following commands:
mouse <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mouseqtl.rds"))
pedigree <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mousepedigree.rds"))
First let us have a quick look at the mouse data again. Use the
str
function to get a fast overview of the pedigree you are
working.
str(pedigree)
## 'data.frame': 1267 obs. of 6 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sire : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dam : int 0 0 0 0 0 0 0 0 0 0 ...
## $ family : Factor w/ 68 levels "0/0","1/2","11/12",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : chr "Male" "Female" "Male" "Female" ...
## $ generation: chr "M6" "IC" "IC" "M6" ...
The number of individuals and generations in the pedigree can be found using the following commands:
nrow(pedigree)
## [1] 1267
dim(pedigree)
## [1] 1267 6
table(pedigree$generation)
##
## F1 F2 IC M6
## 66 1177 12 12
The genetic relationship matrix \(A\) is used for estimating additive genetic
values. The matrix \(A\) can be
computed using the recursive algorithm implemented in the function
grm
from the qgg package. Use the command below to compute
the genetic relationship matrix for the mouse pedigree:
A <- grm(pedigree=pedigree)
The dimension of the genetic relationship matrix can be determined using the following command:
dim(A)
## [1] 1267 1267
The number of rows and columns should be equal to the number of individuals in the pedigree.
The next step is to prepare the linear mixed model for the mouse
data. Recall that the linear mixed model contains the observation vector
for the trait(s) of interest (\(y\)),
the fixed effects
that explain systematic differences in
\(y\), and the
random genetic effects
\(a\) and random residual effects \(e\).
A matrix formulation of a general model equation is: \[\begin{align} y &= Xb + a + e \notag \end{align}\]
where \[\begin{align} y &: \text{is the vector of observed values of the trait,} \notag \\ b &: \text{is a vector of fixed effects,} \notag \\ a &: \text{is a vector of random genetic effects,} \notag \\ e &: \text{is a vector of random residual effects,} \notag \\ X &: \text{is a known design matrix that relates the elements of b to their corresponding element in y.} \notag \end{align}\]
In the statistical model (specified above) the random effects (\(a\) and \(e\)) and the phenotypes (\(y\)) are considered to be random variables
which follow a multivariate normal distribution. In general terms the
expectations of these random variables are:
\[\begin{align}
a \sim MVN(0,A\sigma_a^2) \notag \\
e \sim MVN(0,I\sigma_e^2) \notag \\
y \sim MVN(Xb,V) \notag \\
\end{align}\] where \(A\sigma_a^2\), and \(I\sigma_e^2\) are square matrices of
genetic and residual (co)variances among the individuals, respectively,
and \(V=A\sigma_a^2+I\sigma_e^2\) is
the overall phenotypic covariance matrix. In the previous section we
have already constructed the genetic relationship matrix \(A\).
In order to specify the linear mixed model we need to construct \(y\) and \(X\) from the mouse data. Let us just have a quick look at the mouse data again:
str(mouse)
## 'data.frame': 1177 obs. of 8 variables:
## $ sire : Factor w/ 11 levels "25","28","34",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ dam : Factor w/ 55 levels "26","27","29",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 2 2 2 1 1 ...
## $ reps : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ Gl : num 187 136 115 125 112 190 169 159 111 89 ...
## $ BW : num 36.6 33.3 42.1 37.1 38.4 ...
## $ M227 : Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 1 2 1 2 2 ...
## $ M1139: Factor w/ 3 levels "AA","AB","BB": 3 NA 1 1 1 2 3 3 2 2 ...
Here we will estimate additive genetic values for body weight. The vector of observed trait values for body weight can be extracted from the mouse data as follows:
y <- mouse[,"BW"]
Let us explore the trait values using the head
,
tail
and summary
functions:
head(y)
## [1] 36.65 33.29 42.07 37.15 38.39 39.82
tail(y)
## [1] 39.67 39.35 44.80 52.23 47.63 54.10
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.04 34.06 38.32 38.72 43.40 60.28
To make the \(X\) matrix we need to
decide which variables we should include as fixed effects in the model.
Here we use the variables sex and reps. The model.matrix
function can be used to construct the \(X\) matrix in the linear mixed model
specified above:
X <- model.matrix(BW ~ sex + reps, data=mouse)
We can use the head
and tail
functions to
look at the \(X\) matrix:
head(X)
## (Intercept) sexMale reps2 reps3
## 91 1 0 0 0
## 92 1 0 0 0
## 93 1 0 0 0
## 94 1 0 0 0
## 95 1 1 0 0
## 96 1 1 0 0
tail(X)
## (Intercept) sexMale reps2 reps3
## 1262 1 0 0 1
## 1263 1 0 0 1
## 1264 1 1 0 1
## 1265 1 1 0 1
## 1266 1 1 0 1
## 1267 1 1 0 1
Answer:
The BLUP analysis is based on estimates of the variance components (i.e. \(\sigma_{a}^2\) and \(\sigma_{e}^2\)). The variance components are estimated using REML method. The input required the vector of observed values of the trait (\(y\)), the design matrix for the fixed effects (\(X\)), and the genetic relationship matrix (\(A\)).
The genetic relationship matrix \(A\) include relationships for all individuals in the pedigree. However only a subset of the individuals have phenotypes recorded for body weight and glucose levels in blood. Therefore we need to subset the \(A\) matrix:
ids <- rownames(X)
A <- A[ids,ids]
The REML analysis is done using the following command:
fit <- greml(y=y,X=X, GRM=list(A=A))
The fit object contains estimates of variance components, fixed and
random effects, first and second derivatives of log-likelihood, and the
asymptotic standard deviation of parameter estimates. Our main interest
is the variance components \(\sigma_a^2\) and \(\sigma_e^2\) which are in the
fit$theta
slot of the fit. The following commands extract
and makes a barplot of the estimates of the variance components:
fit$theta
## A E
## 6.569611 13.384147
Va <- fit$theta[1] # First element in theta is the additive genetic variance
Ve <- fit$theta[2] # Second element in theta is the residual variance
barplot(fit$theta)
The goal of the BLUP analysis is the estimate the fixed ,\(b\), and random genetic effects, \(a\), in the linear mixed model specified
above. This can be done using the BLUE
and `BLUP´ equations
shown below:
The best linear unbiased prediction (BLUP) of \(\hat{a}\) is:
\[\begin{equation} \hat{a} = A\sigma_a^2V^{-1}(y - X\hat{b}) \end{equation}\]
The best linear unbiased estimator (BLUE) of \(\hat{b}\) is:
\[\begin{equation} \hat{b} = (X'V^{-1} X)^{-1} X' V^{-1} y \end{equation}\]
The matrix \((X' V^{-1} X)^{-1}\) denotes the inverse of the matrix \((X' V^{-1} X)\).
We have already determined \(y\) and \(X\) and therefore just need to construct the phenotypic covariance matrix \(V\) (and it’s inverse). This can be done using the following lines of R code:
n <- nrow(X) # Number of individuals in the data set
I <- diag(1,n) # Identity matrix for residual effects
V <- A*Va + I*Ve # Phenotypic variance covariance matrix
Vi <- solve(V) # Inverse of phenotypic covariance matrix
The solution to the fixed effects, \(b\), can be found using the following R command:
bhat <- solve(t(X) %*% Vi %*% X)%*%t(X) %*% Vi %*% y
bhat
## [,1]
## (Intercept) 33.8873546
## sexMale 8.3453194
## reps2 -0.3684327
## reps3 2.6411388
The solution to the random genetic effects, \(a\), can be found using the following R command:
ahat <- (A*Va)%*% Vi %*% (y-X%*%bhat)
head(ahat)
## [,1]
## 91 -0.001564943
## 92 -0.663690910
## 93 1.066507303
## 94 0.096965707
## 95 -1.303217779
## 96 -1.021420120
tail(ahat)
## [,1]
## 1262 1.111041
## 1263 1.047981
## 1264 0.477426
## 1265 1.941591
## 1266 1.035109
## 1267 2.310096
Answer:
layout(matrix(1:2,ncol=2))
hist(y)
hist(ahat)
Answer:
plot(ahat,y)
cor(ahat,y)
## [,1]
## [1,] 0.5708277
In this practical we will estimate the additive genomic values for
quantitative traits in the mouse population. We will be using the GBLUP
method. This method allow for estimation of additive genomic values
using phenotypic and genotypic information for individuals from a
general pedigree. GBLUP is based on linear mixed model methodology and
estimates of additive genomic values can be obtained by solving the
mixed model equations. The GBLUP method also require a genomic
relationship matrix estimated from genetic marker data and estimates of
variance components (e.g., \(\sigma_a^2\) and \(\sigma_e^2\)). These methods are
implemented in the R package qgg
introduced previously.
Use the following code to load the qgg package:
library(qgg) # R package used for REML/BLUP analysis
The mouse phenotype data, pedigree and genotype data can be loaded using the following commands:
mouse <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mouseqtl.rds"))
pedigree <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mousepedigree.rds"))
genotypes <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mousegenotypes_imputed.rds"))
First let us have a quick look at the mouse genotype data. Use the
str
function to get a fast overview of the genotypes you
are working with.
The genotypes for each marker are coded as 0,1 or 2 corresponding to the number of copies of the minor allele. The number of individuals and number of genetic markers in the data can be found using the following commands:
nrow(genotypes)
## [1] 1267
dim(genotypes)
## [1] 1267 1813
The genomic relationship matrix \(G\) is used for estimating additive genomic
values. The matrix \(G\) can be
computed using genetic marker data. This is implemented in the function
grm
from the qgg
package. Use the command
below to compute the genomic relationship matrix for the mouse
population:
W <- scale(genotypes) # here we center and scale columns in genotypes (i.e., mean=0, sd=1)
G <- grm(W=W)
The dimension of the genomic relationship matrix can be determined using the following command:
dim(G)
## [1] 1267 1267
The number of rows and columns should be equal to the number of individuals in the genotype matrix.
Make a plot of the genomic relationship matrix using the
corrplot
function from the corrplot R package:
corrplot(G, method="color", bg="white", col= colorRampPalette(c("#B4362AFF", "#DA5A5AFF","white", "#62AFD7FF", "#233253FF"))(10) ,
tl.pos="n", outline=FALSE, xlab=FALSE, ylab=FALSE, is.corr=FALSE)
Answer:
To better compare the pedigree based genetic relationship matrix \(A\) and the genomic relationship matrix \(G\) we can make a scatter plot of the values from the two matrices:
A <- grm(pedigree=pedigree)
dim(A)
## [1] 1267 1267
plot(as.vector(A),as.vector(G))
Answer:
The next step is to prepare the linear mixed model for the mouse
data. The linear mixed model contains the observation vector for the
trait(s) of interest (\(y\)), the
fixed effects
that explain systematic differences in \(y\), and the
random genetic effects
\(a\) and random residual effects \(e\).
A matrix formulation of a general model equation is: \[\begin{align} y &= Xb + a + e \notag \end{align}\]
where \[\begin{align} y &: \text{is the vector of observed values of the trait,} \notag \\ b &: \text{is a vector of fixed effects,} \notag \\ a &: \text{is a vector of random genetic effects,} \notag \\ e &: \text{is a vector of random residual effects,} \notag \\ X &: \text{is a known design matrix that relates the elements of b to their corresponding element in y.} \notag \end{align}\]
In the statistical model (specified above) the random effects (\(a\) and \(e\)) and the phenotypes (\(y\)) are considered to be random variables
which follow a multivariate normal distribution. In general terms the
expectations of these random variables are:
\[\begin{align}
a \sim MVN(0,G\sigma_a^2) \notag \\
e \sim MVN(0,I\sigma_e^2) \notag \\
y \sim MVN(Xb,V) \notag \\
\end{align}\] where \(G\sigma_a^2\), and \(I\sigma_e^2\) are square matrices of
genetic and residual (co)variances among the individuals, respectively,
and \(V=G\sigma_a^2+I\sigma_e^2\) is
the overall phenotypic covariance matrix.
The main difference is that we use the genomic relationship matrix \(G\) instead of the pedigree based genetic relationship matrix \(A\).
In order to specify the linear mixed model we need to construct \(y\) and \(X\) from the mouse data. Let us just have a quick look at the mouse data again:
Here we will estimate additive genomic values for body weight. The vector of observed trait values for body weight can be extracted from the mouse data as follows:
y <- mouse[,"BW"]
To make the \(X\) matrix we need to
decide which variables we should include as fixed effects in the model.
Here we use the variables sex and reps. The model.matrix
function can be used to construct the \(X\) matrix in the linear mixed model
specified above:
X <- model.matrix(BW ~ sex + reps, data=mouse)
The GBLUP analysis is based on estimates of the variance components (i.e. \(\sigma_{a}^2\) and \(\sigma_{e}^2\)). The variance components are estimated using REML method. The input required the vector of observed values of the trait (\(y\)), the design matrix for the fixed effects (\(X\)), and the genetic relationship matrix (\(A\)).
The genetic relationship matrix \(A\) include relationships for all individuals in the pedigree. However only a subset of the individuals have phenotypes recorded for body weight. Therefore we need to subset the \(A\) matrix. The REML analysis is done using the following command:
ids <- rownames(X)
fit <- greml(y=y,X=X, GRM=list(A=A[ids,ids]))
The fit object contains estimates of variance components \(\sigma_a^2\) and \(\sigma_e^2\) which are in the
fit$theta
slot of the fit:
fit$theta
## A E
## 6.569611 13.384147
Va <- fit$theta[1] # First element in theta is the additive genetic variance
Ve <- fit$theta[2] # Second element in theta is the residual variance
The goal of the GBLUP analysis is the estimate the fixed ,\(b\), and random genetic effects, \(a\), in the linear mixed model specified
above. This can be done using the BLUE
and `BLUP´ equations
as shown previously. The best linear unbiased prediction (BLUP) of \(\hat{a}\) is:
\[\begin{equation} \hat{a} = G\sigma_a^2V^{-1}(y - X\hat{b}) \end{equation}\]
The best linear unbiased estimator (BLUE) of \(\hat{b}\) is:
\[\begin{equation} \hat{b} = (X'V^{-1} X)^{-1} X' V^{-1} y \end{equation}\].
We have allready determined \(y\) and \(X\) and therefore just need to construct the phenotypic covariance matrix \(V\) (and it’s inverse). This can be done using the following lines of R code:
ids <- rownames(X) # Individuals with phenotypes
n <- nrow(X) # Number of individuals in the data set
I <- diag(1,n) # Identity matrix for residual effects
V <- G[ids,ids]*Va + I*Ve # Phenotypic variance covariance matrix
Vi <- solve(V) # Inverse of phenotypic covariance matrix
The solution to the fixed effects, \(b\), can be found using the following R command:
bhat <- solve(t(X) %*% Vi %*% X)%*%t(X) %*% Vi %*% y
bhat
## [,1]
## (Intercept) 33.9244213
## sexMale 8.4114162
## reps2 -0.5009359
## reps3 2.6993539
The solution to the random genetic effects, \(a\), can be found using the following R command:
ahat_G <- (G[,ids]*Va)%*% Vi %*% (y-X%*%bhat) # Genomic based BLUP
V <- A[ids,ids]*Va + I*Ve # Phenotypic variance covariance matrix
Vi <- solve(V) # Inverse of phenotypic covariance matrix
bhat <- solve(t(X) %*% Vi %*% X)%*%t(X) %*% Vi %*% y
ahat_A <- (A[,ids]*Va)%*% Vi %*% (y-X%*%bhat) # Genomic based BLUP
Answer:
layout(matrix(1:2,ncol=2))
hist(ahat_A)
hist(ahat_G)
We want to compare the additive genomic values estimated using pedigree and genomic information. This can be done by computing the correlation or by making a scatter plot for the additive genetic/genomic values obtained using pedigree or genetic marker information.
Answer:
plot(ahat_G,ahat_A)
cor(ahat_G,ahat_A)
## [,1]
## [1,] 0.6996325