Introduction

In these practical sessions, we will analyze quantitative traits in a mouse population. The dataset includes phenotypic measurements related to growth and obesity, such as body weight and blood glucose levels, along with genetic marker data.

The M16 mouse was developed as an outbred model for early-onset polygenic obesity and diabesity. This was achieved through 27 generations of selective breeding for weight gain between 3 and 6 weeks of age, starting from an outbred ICR base population. The breeding strategy involved selecting the male and female with the highest weight gain within each litter. A parallel ICR control line was maintained through random mating while preserving a similar effective population size. Compared to ICR mice, M16 mice are consistently larger at all ages and exhibit increased body fat percentage, fat cell size and number, as well as larger organ weights. They also display hyperphagia, moderate obesity, and metabolic abnormalities, including hyperglycemia, hyperinsulinemia, and hypercholesterolemia.

The ICR mouse is an albino strain originally derived from SWISS mice and selectively bred by Dr. Hauschka to establish a highly fertile line. Named after the Institute of Cancer Research (ICR) in the USA, where it was developed, this strain has been widely distributed for research. ICR mice are relatively large, docile, and grow well, making them a widely used general-purpose model in various research fields, including toxicity studies, pharmacology, drug efficacy, and immunology.

A large F2 population (n=1181) was established by crossing the M16 and ICR lines. Twelve F1 families resulted from six pair matings of M16 males x ICR females and six pair matings of the reciprocal cross. A total of 55 F1 dams were mated to 11 F1 sires in sets of five F1 full sisters mated to the same F1 sire. These same specific matings were repeated in three consecutive replicates. Thus, the F2 population consisted of 55 full-sib families of up to 24 individuals each and 11 sire families families of up to 120 individuals each. Actual numbers of mice within families varied slightly due to a small number of failed pregnancies. All litters were standardized at birth to eight pups, with approximately equal representation of males and females.

More information about the mouse data can be found in the following publications:

Allan, M.F., Eisen, E.J. and Pomp, D. (2004). The M16 Mouse: An Outbred Animal Model of Early Onset Polygenic Obesity and Diabesity. Obesity Research, 12: 1397-1407. https://doi.org/10.1038/oby.2004.176

Allan, M. F., Eisen E. J and Pomp, D. (2005). Genomic Mapping of Direct and Correlated Responses to Long-Term Selection for Rapid Growth Rate in Mice. Genetics, 170(4): 1863–1877. https://doi.org/10.1534/genetics.105.041319



Practical 3: Estimation of Genetic Parameters

Introduction

In this practical we will estimate genetic parameters (heritability) for quantitative traits observed in the F2 mouse population. We will be using the REML method. This method allow for estimation of genetic parameters using phenotypic information for individuals from a general pedigree. REML is based on linear mixed model methodology and uses a likelihood approach to estimate genetic parameters. The REML method also require us to calculate an genetic relationship matrix using a recursive algorithm. These methods and algorithms are implemented in the R package qgg.

This package provides an infrastructure for efficient processing of large-scale genetic and phenotypic data including core functions for:

  • fitting linear mixed models
  • constructing genetic relationship matrices
  • estimating genetic parameters (heritability and correlation)
  • performing genomic prediction and genetic risk profiling
  • single or multi-marker association analyses

We will also be using the qgg package for the remaining practicals.

Installation of the R package qgg

You can install qgg from CRAN with:

install.packages("qgg")

Load R packages that will be used in this practical

library(qgg) # R package used for REML analysis
#install.packages("corrplot") 
library(corrplot)

Explore mouse pedigree data

The mouse data set can be loaded using the following command:

mouse <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mouseqtl.rds"))

The mouse pedigree is loaded in a similar way using the following command:

pedigree <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mousepedigree.rds"))
Question 1: Which variables do we have in the pedigree?

Use the str function to get a fast overview of the pedigree you are working.

Answer:

str(pedigree)
## 'data.frame':    1267 obs. of  6 variables:
##  $ id        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sire      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ dam       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ family    : Factor w/ 68 levels "0/0","1/2","11/12",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : chr  "Male" "Female" "Male" "Female" ...
##  $ generation: chr  "M6" "IC" "IC" "M6" ...
Question 2: How many individuals do we have in the pedigree?

Answer:

nrow(pedigree)
## [1] 1267
dim(pedigree)
## [1] 1267    6
Question 3: How many generations and number of mice in each generation do we have in the pedigree?

Use the table function on the generation variable.

Answer:

table(pedigree$generation)
## 
##   F1   F2   IC   M6 
##   66 1177   12   12

Computing genetic relationship matrix for the mouse pedigree:

The REML analysis require us to calculate the genetic relationship matrix \(A\). This is done using information about the id, mother, and father which is available in our pedigree data file.

To illustrate this step we will first calculate it for a small part of the mouse pedigree. We are given the following pedigree and we want to compute the matrix \(A\).

family <- c(13,14,84,1244,1248)
pedigree[family,]
##        id sire dam family    sex generation
## 13     13    0   0    0/0   Male         IC
## 14     14    0   0    0/0 Female         M6
## 84     84   13  14  13/14 Female         F1
## 1244 1244   78  84  78/84 Female         F2
## 1248 1248   78  84  78/84   Male         F2

The additive genetic relationship (\(A_{ij}\)) between the various sources (j) and the individual itself, i.e. the candidate to be evaluated (i), can be seen in the table below.

Relative \(A_{i,j}\)
Self 1.0
Unrelated 0
Mother 0.5
Father 0.5
Grandparent 0.25
Half-sib 0.25
Full-sib 0.5
Progeny 0.5

Next we will compute the genetic relationship matrix for the entire mouse pedigree. The matrix \(A\) can be computed using a recursive algorithm implemented in the function grm from the qgg package. Use the command below to compute the genetic relationship matrix for the mouse pedigree:

A <- grm(pedigree=pedigree) 
Question 4: What is the dimension of the genetic relationship matrix?

Answer:

dim(A)
## [1] 1267 1267

The number of rows and columns should be equal to the number of individuals in the pedigree. Check the first 5 individuals in the matrix using the following command:

A[1:5,1:5]
##   1 2 3 4 5
## 1 1 0 0 0 0
## 2 0 1 0 0 0
## 3 0 0 1 0 0
## 4 0 0 0 1 0
## 5 0 0 0 0 1
Question 6: How should we interpret this value?

Answer:

Previously we have determined the genetic relationship matrix for a small part of the mouse pedigree. We can extract the corresponding elements from the \(A\) matrix for the entire mouse pedigree using the following command:

ids <- c(13,14,84,1244,1248)
A[ids,ids]
##        13   14  84 1244 1248
## 13   1.00 0.00 0.5 0.25 0.25
## 14   0.00 1.00 0.5 0.25 0.25
## 84   0.50 0.50 1.0 0.50 0.50
## 1244 0.25 0.25 0.5 1.00 0.50
## 1248 0.25 0.25 0.5 0.50 1.00
Question 6: Are the values in this part of the genetic relationship matrix the same as you have found using the “manual” approach?

Answer:

Make a plot of the genetic relationship matrix using the corrplot function from the corrplot R package:

corrplot(A, method="color", bg="white", col= colorRampPalette(c("white", "#62AFD7FF", "#233253FF"))(10) ,
         tl.pos="n", outline=FALSE, xlab=FALSE, ylab=FALSE, is.corr=FALSE)

Question 7: Describe the plot you just made of the genetic relationship?

Answer:

Specifying the linear mixed model for the mouse data:

The next step is to prepare the linear mixed model for the mouse data. Recall that the linear mixed model contains the observation vector for the trait(s) of interest (\(y\)), the fixed effects that explain systematic differences in \(y\), and the random genetic effects \(a\) and random residual effects \(e\).

A matrix formulation of a general model equation is: \[\begin{align} y &= Xb + a + e \notag \end{align}\]

where \[\begin{align} y &: \text{is the vector of observed values of the trait,} \notag \\ b &: \text{is a vector of fixed effects,} \notag \\ a &: \text{is a vector of random genetic effects,} \notag \\ e &: \text{is a vector of random residual effects,} \notag \\ X &: \text{is a known design matrix that relates the elements of b to their corresponding element in y.} \notag \end{align}\]

In the statistical model (specified above) the random effects (\(a\) and \(e\)) and the phenotypes (\(y\)) are considered to be random variables which follow a multivariate normal distribution: In general terms the expectations of these random variables are:
\[\begin{align} E(y) &= Xb \notag \\ E(a) &= 0 \notag \\ E(e) &= 0 \notag \\ \end{align}\] and the variance-covariance matrices are: \[\begin{align} Var(a) &= A\sigma_a^2 \notag \\ Var(e) &= I\sigma_e^2 \notag \\ Var(y) &= A\sigma_a^2 + I\sigma_e^2 \notag \end{align}\]

where \(A\sigma_a^2\), and \(I\sigma_e^2\) are square matrices of genetic and residual (co)variances among the individuals, respectively. In the previous section we have allready constructed the genetic relationship matrix \(A\).

In order to perform the REML analysis we need to construct \(y\) and \(X\) from the mouse data. Let us just have a quick look at the mouse data again:

str(mouse)
## 'data.frame':    1177 obs. of  8 variables:
##  $ sire : Factor w/ 11 levels "25","28","34",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ dam  : Factor w/ 55 levels "26","27","29",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ sex  : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 2 2 2 1 1 ...
##  $ reps : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Gl   : num  187 136 115 125 112 190 169 159 111 89 ...
##  $ BW   : num  36.6 33.3 42.1 37.1 38.4 ...
##  $ M227 : Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 1 2 1 2 2 ...
##  $ M1139: Factor w/ 3 levels "AA","AB","BB": 3 NA 1 1 1 2 3 3 2 2 ...

Here we will estimate the heritability for body weight. The vector of observed trait values for body weight can be extracted from the mouse data as follows:

y <- mouse[,"BW"]

Let us explore the trait values using the head, tail and summary functions:

head(y)
## [1] 36.65 33.29 42.07 37.15 38.39 39.82
tail(y)
## [1] 39.67 39.35 44.80 52.23 47.63 54.10
summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.04   34.06   38.32   38.72   43.40   60.28

To make the \(X\) matrix we need to decide which variables we should include as fixed effects in the model. We have sex, reps, sire, dam, M227 and M1139 in the mouse data frame.

Question 8: Which variables should we include as fixed effects in the model?

Answer:

The model.matrix function can be used to construct the \(X\) matrix in the linear mixed model specified above:

X <- model.matrix(BW ~ sex + reps, data=mouse)

We can use the head and tail functions to look at the \(X\) matrix:

head(X)
##    (Intercept) sexMale reps2 reps3
## 91           1       0     0     0
## 92           1       0     0     0
## 93           1       0     0     0
## 94           1       0     0     0
## 95           1       1     0     0
## 96           1       1     0     0
tail(X)
##      (Intercept) sexMale reps2 reps3
## 1262           1       0     0     1
## 1263           1       0     0     1
## 1264           1       1     0     1
## 1265           1       1     0     1
## 1266           1       1     0     1
## 1267           1       1     0     1

Estimating genetic parameters on the mouse data using REML

The goal of the REML analysis to estimate the parameters (i.e. variance components \(\sigma_{a}^2\) and \(\sigma_{e}^2\)) in the linear mixed model specified above. In this analysis we find the set of parameters which maximizes the likelihood of the data, i.e., the probability of observations given the model and its parameter estimates: \(p(y|\hat{b}, \hat{\sigma}^2_{a}, \hat{\sigma}^2_{e})\).

The input required the vector of observed values of the trait (\(y\)), the deisgn matrix for the fixed effects (\(X\)), and the genetic relationship matrix (\(A\)). The \(A\) matrix calculated previously include genetic relationships for all individuals in the pedigree. However only a subset of the inviduals have phenotypes recorded for body weight. Therefore we need to subset the \(A\) matrix as shown in the R code below:

ids <- rownames(X)
A <- A[ids,ids]

The REML method is implemented in the greml function from the “qgg” package. The REML analysis is done using the following command:

fit <- greml(y=y,X=X, GRM=list(A=A))

The fit object (i.e., output from the greml function) contains estimates of variance components, fixed and random effects, first and second derivatives of log-likelihood, and the asymptotic standard deviation of parameter estimates.

Our main interest is the variance components \(\sigma_a^2\) and \(\sigma_e^2\) which are in the fit$theta slot of the fit. The following commands extract and makes a barplot of the estimates of the variance components:

fit$theta
##         A         E 
##  6.569611 13.384147
barplot(fit$theta)

The first element in the theta vector is the estimate of the additive genetic variance (\(\hat{\sigma}^2_{a}\)) and the second element is the estimate of the residual variance (\(\hat{\sigma}^2_{e}\)).

Va <- fit$theta[1]
Ve <- fit$theta[2]
Va
##        A 
## 6.569611
Ve
##        E 
## 13.38415

From the REML estimate of the variance components, the heritability can easily be computed by: \[\begin{align} h^2 &= \sigma^2_{a}/(\sigma^2_a+\sigma^2_e) \end{align}\]

Question 9: What is the heritability for body weight?

Answer:

Va/(Va+Ve)
##         A 
## 0.3292418

In the experiment the mice were feed ad libitum. Now we want to perform a simlar experiment where mice are reared under restricted feed intake, We will record phenotypes for body weight and blood glucose levels and use mice from the same F2 population.

Question 10: Should we re-estimate the heritability?

Answer:

Question 11: What is the heritability for glucose levels in the blood?

Answer:

y <- mouse[,"Gl"]
X <- model.matrix(Gl ~ sex + reps, data=mouse)
ids <- rownames(X)
A <- grm(pedigree=pedigree) 
A <- A[ids,ids]
fit <- greml(y=y,X=X, GRM=list(A=A))
Va <- fit$theta[1]
Ve <- fit$theta[2]
Va/(Va+Ve)
##         A 
## 0.4058474

Practical 4: Estimation of additive genetic values

Introduction

In this practical we will estimate additive genetic values for quantitative traits in the mouse population. We will be using the BLUP method. This method allow for estimation of additive genetic values using phenotypic information for individuals from a general pedigree. BLUP is based on linear mixed model methodology and estimates of additive genetic values can be obtained by solving the mixed model equations. The BLUP method also require a genetic relationship matrix and estimates of variance components (e.g., \(\sigma_a^2\) and \(\sigma_e^2\)). These methods and algorithms are implemented in the R package qgg introduced previously.

Load R packages that will be used in this practical

Use the following code to load the qgg package:

library(qgg) # R package used for REML/BLUP analysis

Explore mouse pedigree data

The mouse data and pedigree set can be loaded using the following commands:

mouse <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mouseqtl.rds"))
pedigree <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mousepedigree.rds"))

First let us have a quick look at the mouse data again. Use the str function to get a fast overview of the pedigree you are working.

str(pedigree)
## 'data.frame':    1267 obs. of  6 variables:
##  $ id        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sire      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ dam       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ family    : Factor w/ 68 levels "0/0","1/2","11/12",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : chr  "Male" "Female" "Male" "Female" ...
##  $ generation: chr  "M6" "IC" "IC" "M6" ...

The number of individuals and generations in the pedigree can be found using the following commands:

nrow(pedigree)
## [1] 1267
dim(pedigree)
## [1] 1267    6
table(pedigree$generation)
## 
##   F1   F2   IC   M6 
##   66 1177   12   12

Computing genetic relationship matrix for the mouse pedigree

The genetic relationship matrix \(A\) is used for estimating additive genetic values. The matrix \(A\) can be computed using the recursive algorithm implemented in the function grm from the qgg package. Use the command below to compute the genetic relationship matrix for the mouse pedigree:

A <- grm(pedigree=pedigree) 

The dimension of the genetic relationship matrix can be determined using the following command:

dim(A)
## [1] 1267 1267

The number of rows and columns should be equal to the number of individuals in the pedigree.

Specifying the linear mixed model for the mouse data

The next step is to prepare the linear mixed model for the mouse data. Recall that the linear mixed model contains the observation vector for the trait(s) of interest (\(y\)), the fixed effects that explain systematic differences in \(y\), and the random genetic effects \(a\) and random residual effects \(e\).

A matrix formulation of a general model equation is: \[\begin{align} y &= Xb + a + e \notag \end{align}\]

where \[\begin{align} y &: \text{is the vector of observed values of the trait,} \notag \\ b &: \text{is a vector of fixed effects,} \notag \\ a &: \text{is a vector of random genetic effects,} \notag \\ e &: \text{is a vector of random residual effects,} \notag \\ X &: \text{is a known design matrix that relates the elements of b to their corresponding element in y.} \notag \end{align}\]

In the statistical model (specified above) the random effects (\(a\) and \(e\)) and the phenotypes (\(y\)) are considered to be random variables which follow a multivariate normal distribution. In general terms the expectations of these random variables are:
\[\begin{align} a \sim MVN(0,A\sigma_a^2) \notag \\ e \sim MVN(0,I\sigma_e^2) \notag \\ y \sim MVN(Xb,V) \notag \\ \end{align}\] where \(A\sigma_a^2\), and \(I\sigma_e^2\) are square matrices of genetic and residual (co)variances among the individuals, respectively, and \(V=A\sigma_a^2+I\sigma_e^2\) is the overall phenotypic covariance matrix. In the previous section we have already constructed the genetic relationship matrix \(A\).

In order to specify the linear mixed model we need to construct \(y\) and \(X\) from the mouse data. Let us just have a quick look at the mouse data again:

str(mouse)
## 'data.frame':    1177 obs. of  8 variables:
##  $ sire : Factor w/ 11 levels "25","28","34",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ dam  : Factor w/ 55 levels "26","27","29",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ sex  : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 2 2 2 1 1 ...
##  $ reps : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Gl   : num  187 136 115 125 112 190 169 159 111 89 ...
##  $ BW   : num  36.6 33.3 42.1 37.1 38.4 ...
##  $ M227 : Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 1 2 1 2 2 ...
##  $ M1139: Factor w/ 3 levels "AA","AB","BB": 3 NA 1 1 1 2 3 3 2 2 ...

Here we will estimate additive genetic values for body weight. The vector of observed trait values for body weight can be extracted from the mouse data as follows:

y <- mouse[,"BW"]

Let us explore the trait values using the head, tail and summary functions:

head(y)
## [1] 36.65 33.29 42.07 37.15 38.39 39.82
tail(y)
## [1] 39.67 39.35 44.80 52.23 47.63 54.10
summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.04   34.06   38.32   38.72   43.40   60.28

To make the \(X\) matrix we need to decide which variables we should include as fixed effects in the model. Here we use the variables sex and reps. The model.matrix function can be used to construct the \(X\) matrix in the linear mixed model specified above:

X <- model.matrix(BW ~ sex + reps, data=mouse)

We can use the head and tail functions to look at the \(X\) matrix:

head(X)
##    (Intercept) sexMale reps2 reps3
## 91           1       0     0     0
## 92           1       0     0     0
## 93           1       0     0     0
## 94           1       0     0     0
## 95           1       1     0     0
## 96           1       1     0     0
tail(X)
##      (Intercept) sexMale reps2 reps3
## 1262           1       0     0     1
## 1263           1       0     0     1
## 1264           1       1     0     1
## 1265           1       1     0     1
## 1266           1       1     0     1
## 1267           1       1     0     1
Question 1: Why do we not include the effect of sire and dam in the model?

Answer:

Estimating genetic parameters on the mouse data using REML

The BLUP analysis is based on estimates of the variance components (i.e. \(\sigma_{a}^2\) and \(\sigma_{e}^2\)). The variance components are estimated using REML method. The input required the vector of observed values of the trait (\(y\)), the design matrix for the fixed effects (\(X\)), and the genetic relationship matrix (\(A\)).

The genetic relationship matrix \(A\) include relationships for all individuals in the pedigree. However only a subset of the individuals have phenotypes recorded for body weight and glucose levels in blood. Therefore we need to subset the \(A\) matrix:

ids <- rownames(X)
A <- A[ids,ids]

The REML analysis is done using the following command:

fit <- greml(y=y,X=X, GRM=list(A=A))

The fit object contains estimates of variance components, fixed and random effects, first and second derivatives of log-likelihood, and the asymptotic standard deviation of parameter estimates. Our main interest is the variance components \(\sigma_a^2\) and \(\sigma_e^2\) which are in the fit$theta slot of the fit. The following commands extract and makes a barplot of the estimates of the variance components:

fit$theta
##         A         E 
##  6.569611 13.384147
Va <- fit$theta[1] # First element in theta is the additive genetic variance
Ve <- fit$theta[2] # Second element in theta is the residual variance
barplot(fit$theta)

Estimating additive genetic values for traits in the mouse data using BLUP

The goal of the BLUP analysis is the estimate the fixed ,\(b\), and random genetic effects, \(a\), in the linear mixed model specified above. This can be done using the BLUE and `BLUP´ equations shown below:

The best linear unbiased prediction (BLUP) of \(\hat{a}\) is:

\[\begin{equation} \hat{a} = A\sigma_a^2V^{-1}(y - X\hat{b}) \end{equation}\]

The best linear unbiased estimator (BLUE) of \(\hat{b}\) is:

\[\begin{equation} \hat{b} = (X'V^{-1} X)^{-1} X' V^{-1} y \end{equation}\]

The matrix \((X' V^{-1} X)^{-1}\) denotes the inverse of the matrix \((X' V^{-1} X)\).

We have already determined \(y\) and \(X\) and therefore just need to construct the phenotypic covariance matrix \(V\) (and it’s inverse). This can be done using the following lines of R code:

n <- nrow(X)       # Number of individuals in the data set
I <- diag(1,n)     # Identity matrix for residual effects  
V <-  A*Va + I*Ve  # Phenotypic variance covariance matrix
Vi <- solve(V)     # Inverse of phenotypic covariance matrix

The solution to the fixed effects, \(b\), can be found using the following R command:

bhat <- solve(t(X) %*% Vi %*% X)%*%t(X) %*% Vi %*% y
bhat
##                   [,1]
## (Intercept) 33.8873546
## sexMale      8.3453194
## reps2       -0.3684327
## reps3        2.6411388

The solution to the random genetic effects, \(a\), can be found using the following R command:

ahat <- (A*Va)%*% Vi %*% (y-X%*%bhat)
head(ahat)
##            [,1]
## 91 -0.001564943
## 92 -0.663690910
## 93  1.066507303
## 94  0.096965707
## 95 -1.303217779
## 96 -1.021420120
tail(ahat)
##          [,1]
## 1262 1.111041
## 1263 1.047981
## 1264 0.477426
## 1265 1.941591
## 1266 1.035109
## 1267 2.310096
Question 2: Make histogram for $ and the estimated additive genetic values. What do you think about their distribution?

Answer:

layout(matrix(1:2,ncol=2))
hist(y)
hist(ahat)

Question 3: Make a scatter plot of y and the estimated additive genetic values. What do you think about their relationship?

Answer:

plot(ahat,y)

cor(ahat,y)
##           [,1]
## [1,] 0.5708277

Practical 5: Estimation of additive genomic values

Introduction

In this practical we will estimate the additive genomic values for quantitative traits in the mouse population. We will be using the GBLUP method. This method allow for estimation of additive genomic values using phenotypic and genotypic information for individuals from a general pedigree. GBLUP is based on linear mixed model methodology and estimates of additive genomic values can be obtained by solving the mixed model equations. The GBLUP method also require a genomic relationship matrix estimated from genetic marker data and estimates of variance components (e.g., \(\sigma_a^2\) and \(\sigma_e^2\)). These methods are implemented in the R package qgg introduced previously.

Load R packages that will be used in this practical

Use the following code to load the qgg package:

library(qgg) # R package used for REML/BLUP analysis

Explore mouse pedigree data

The mouse phenotype data, pedigree and genotype data can be loaded using the following commands:

mouse <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mouseqtl.rds"))
pedigree <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mousepedigree.rds"))
genotypes <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mousegenotypes_imputed.rds"))

First let us have a quick look at the mouse genotype data. Use the str function to get a fast overview of the genotypes you are working with.

The genotypes for each marker are coded as 0,1 or 2 corresponding to the number of copies of the minor allele. The number of individuals and number of genetic markers in the data can be found using the following commands:

nrow(genotypes)
## [1] 1267
dim(genotypes)
## [1] 1267 1813

Computing genomic relationship matrix using marker data

The genomic relationship matrix \(G\) is used for estimating additive genomic values. The matrix \(G\) can be computed using genetic marker data. This is implemented in the function grm from the qgg package. Use the command below to compute the genomic relationship matrix for the mouse population:

W <- scale(genotypes) # here we center and scale columns in genotypes (i.e., mean=0, sd=1)
G <- grm(W=W) 

The dimension of the genomic relationship matrix can be determined using the following command:

dim(G)
## [1] 1267 1267

The number of rows and columns should be equal to the number of individuals in the genotype matrix.

Make a plot of the genomic relationship matrix using the corrplot function from the corrplot R package:

corrplot(G, method="color", bg="white", col= colorRampPalette(c("#B4362AFF", "#DA5A5AFF","white", "#62AFD7FF", "#233253FF"))(10) ,
         tl.pos="n", outline=FALSE, xlab=FALSE, ylab=FALSE, is.corr=FALSE)

Question 1: Describe the plot you just made of the genomic relationship?

Answer:

To better compare the pedigree based genetic relationship matrix \(A\) and the genomic relationship matrix \(G\) we can make a scatter plot of the values from the two matrices:

A <- grm(pedigree=pedigree) 
dim(A)
## [1] 1267 1267
plot(as.vector(A),as.vector(G))

Question 2: Are the two relationship matrices similar?

Answer:

Specifying the linear mixed model for the mouse data:

The next step is to prepare the linear mixed model for the mouse data. The linear mixed model contains the observation vector for the trait(s) of interest (\(y\)), the fixed effects that explain systematic differences in \(y\), and the random genetic effects \(a\) and random residual effects \(e\).

A matrix formulation of a general model equation is: \[\begin{align} y &= Xb + a + e \notag \end{align}\]

where \[\begin{align} y &: \text{is the vector of observed values of the trait,} \notag \\ b &: \text{is a vector of fixed effects,} \notag \\ a &: \text{is a vector of random genetic effects,} \notag \\ e &: \text{is a vector of random residual effects,} \notag \\ X &: \text{is a known design matrix that relates the elements of b to their corresponding element in y.} \notag \end{align}\]

In the statistical model (specified above) the random effects (\(a\) and \(e\)) and the phenotypes (\(y\)) are considered to be random variables which follow a multivariate normal distribution. In general terms the expectations of these random variables are:
\[\begin{align} a \sim MVN(0,G\sigma_a^2) \notag \\ e \sim MVN(0,I\sigma_e^2) \notag \\ y \sim MVN(Xb,V) \notag \\ \end{align}\] where \(G\sigma_a^2\), and \(I\sigma_e^2\) are square matrices of genetic and residual (co)variances among the individuals, respectively, and \(V=G\sigma_a^2+I\sigma_e^2\) is the overall phenotypic covariance matrix.

The main difference is that we use the genomic relationship matrix \(G\) instead of the pedigree based genetic relationship matrix \(A\).

In order to specify the linear mixed model we need to construct \(y\) and \(X\) from the mouse data. Let us just have a quick look at the mouse data again:

Here we will estimate additive genomic values for body weight. The vector of observed trait values for body weight can be extracted from the mouse data as follows:

y <- mouse[,"BW"]

To make the \(X\) matrix we need to decide which variables we should include as fixed effects in the model. Here we use the variables sex and reps. The model.matrix function can be used to construct the \(X\) matrix in the linear mixed model specified above:

X <- model.matrix(BW ~ sex + reps, data=mouse)

Estimating genetic parameters on the mouse data using REML

The GBLUP analysis is based on estimates of the variance components (i.e. \(\sigma_{a}^2\) and \(\sigma_{e}^2\)). The variance components are estimated using REML method. The input required the vector of observed values of the trait (\(y\)), the design matrix for the fixed effects (\(X\)), and the genetic relationship matrix (\(A\)).

The genetic relationship matrix \(A\) include relationships for all individuals in the pedigree. However only a subset of the individuals have phenotypes recorded for body weight. Therefore we need to subset the \(A\) matrix. The REML analysis is done using the following command:

ids <- rownames(X)
fit <- greml(y=y,X=X, GRM=list(A=A[ids,ids]))

The fit object contains estimates of variance components \(\sigma_a^2\) and \(\sigma_e^2\) which are in the fit$theta slot of the fit:

fit$theta
##         A         E 
##  6.569611 13.384147
Va <- fit$theta[1] # First element in theta is the additive genetic variance
Ve <- fit$theta[2] # Second element in theta is the residual variance

Estimating additive genomic values for traits in the mouse data using GBLUP:

The goal of the GBLUP analysis is the estimate the fixed ,\(b\), and random genetic effects, \(a\), in the linear mixed model specified above. This can be done using the BLUE and `BLUP´ equations as shown previously. The best linear unbiased prediction (BLUP) of \(\hat{a}\) is:

\[\begin{equation} \hat{a} = G\sigma_a^2V^{-1}(y - X\hat{b}) \end{equation}\]

The best linear unbiased estimator (BLUE) of \(\hat{b}\) is:

\[\begin{equation} \hat{b} = (X'V^{-1} X)^{-1} X' V^{-1} y \end{equation}\].

We have allready determined \(y\) and \(X\) and therefore just need to construct the phenotypic covariance matrix \(V\) (and it’s inverse). This can be done using the following lines of R code:

ids <- rownames(X) # Individuals with phenotypes
n <- nrow(X)       # Number of individuals in the data set
I <- diag(1,n)     # Identity matrix for residual effects  
V <-  G[ids,ids]*Va + I*Ve  # Phenotypic variance covariance matrix
Vi <- solve(V)     # Inverse of phenotypic covariance matrix

The solution to the fixed effects, \(b\), can be found using the following R command:

bhat <- solve(t(X) %*% Vi %*% X)%*%t(X) %*% Vi %*% y
bhat
##                   [,1]
## (Intercept) 33.9244213
## sexMale      8.4114162
## reps2       -0.5009359
## reps3        2.6993539

The solution to the random genetic effects, \(a\), can be found using the following R command:

ahat_G <- (G[,ids]*Va)%*% Vi %*% (y-X%*%bhat)  # Genomic based BLUP
V <-  A[ids,ids]*Va + I*Ve  # Phenotypic variance covariance matrix
Vi <- solve(V)     # Inverse of phenotypic covariance matrix
bhat <- solve(t(X) %*% Vi %*% X)%*%t(X) %*% Vi %*% y
ahat_A <- (A[,ids]*Va)%*% Vi %*% (y-X%*%bhat)  # Genomic based BLUP
Question 3: Make histogram for the estimated additive genomic values. What do you think about their distribution?

Answer:

layout(matrix(1:2,ncol=2))
hist(ahat_A)
hist(ahat_G)

We want to compare the additive genomic values estimated using pedigree and genomic information. This can be done by computing the correlation or by making a scatter plot for the additive genetic/genomic values obtained using pedigree or genetic marker information.

Question 4: Make a scatter plot and compute correlation for the estimated additive genetic values and additive genomic values. What do you think about their relationship?

Answer:

plot(ahat_G,ahat_A)

cor(ahat_G,ahat_A)
##           [,1]
## [1,] 0.6996325