#install.packages("corrplot")
library(corrplot)
In this small exercise we will use the mouse population data you have used in previous exercises. The dataset includes phenotypic measurements related to growth and obesity, such as body weight and blood glucose levels, along with genetic marker data.
The M16 mouse was developed as an outbred model for early-onset polygenic obesity and diabesity. This was achieved through 27 generations of selective breeding for weight gain between 3 and 6 weeks of age, starting from an outbred ICR base population. The breeding strategy involved selecting the male and female with the highest weight gain within each litter. A parallel ICR control line was maintained through random mating while preserving a similar effective population size. Compared to ICR mice, M16 mice are consistently larger at all ages and exhibit increased body fat percentage, fat cell size and number, as well as larger organ weights. They also display hyperphagia, moderate obesity, and metabolic abnormalities, including hyperglycemia, hyperinsulinemia, and hypercholesterolemia.
The ICR mouse is an albino strain originally derived from SWISS mice and selectively bred by Dr. Hauschka to establish a highly fertile line. Named after the Institute of Cancer Research (ICR) in the USA, where it was developed, this strain has been widely distributed for research. ICR mice are relatively large, docile, and grow well, making them a widely used general-purpose model in various research fields, including toxicity studies, pharmacology, drug efficacy, and immunology.
A large F2 population (n=1181) was established by crossing the M16 and ICR lines. Twelve F1 families resulted from six pair matings of M16 males x ICR females and six pair matings of the reciprocal cross. A total of 55 F1 dams were mated to 11 F1 sires in sets of five F1 full sisters mated to the same F1 sire. These same specific matings were repeated in three consecutive replicates. Thus, the F2 population consisted of 55 full-sib families of up to 24 individuals each and 11 sire families families of up to 120 individuals each. Actual numbers of mice within families varied slightly due to a small number of failed pregnancies. All litters were standardized at birth to eight pups, with approximately equal representation of males and females.
More information about the mouse data can be found in the following publications:
Allan, M.F., Eisen, E.J. and Pomp, D. (2004). The M16 Mouse: An Outbred Animal Model of Early Onset Polygenic Obesity and Diabesity. Obesity Research, 12: 1397-1407. https://doi.org/10.1038/oby.2004.176
Allan, M. F., Eisen E. J and Pomp, D. (2005). Genomic Mapping of Direct and Correlated Responses to Long-Term Selection for Rapid Growth Rate in Mice. Genetics, 170(4): 1863–1877. https://doi.org/10.1534/genetics.105.041319
First, you should read the mouse phenotype and genotype data:
mouse <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mouseqtl.rds"))
ids <- rownames(mouse)
genotypes <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mousegenotypes_imputed.rds"))
genotypes <- genotypes[ids,]
Answer:
ncol(genotypes)
## [1] 1813
Answer:
table(genotypes[,1])/nrow(genotypes)
##
## 0 1 2
## 0.82497876 0.14358539 0.03143585
Answer:
p.A <- (2*sum(genotypes[,1]==0) + sum(genotypes[,1]==1) ) / (2*nrow(genotypes))
print(p.A)
## [1] 0.8967715
p.a <- (2*sum(genotypes[,1]==2) + sum(genotypes[,1]==1) ) / (2*nrow(genotypes))
print(p.a)
## [1] 0.1032285
We can check whether we have computed the frequencies correct as we know they should sum to 1
(p.A + p.a) == 1
## [1] TRUE
Answer:
par(lwd=1.5)
freq <- apply(genotypes, 2, function(x){ (2*sum(x==0) + sum(x==1) ) / (2*nrow(genotypes)) } )
hist(freq, las=1, cex.axis=.8, xlab="Allele frequency of the first allele", main="", col="lightblue", breaks=20, xlim=c(0,1), ylab="No SNPs")
Answer:
min(freq)
## [1] 0.0480034
colnames(genotypes)[which.min(freq)]
## [1] "1031"
max(freq)
## [1] 0.9961767
colnames(genotypes)[which.max(freq)]
## [1] "1757"
Answer:
keep <- freq > 0.05 & freq < 0.95
genotypes <- genotypes[,keep]
par(lwd=1.5)
freq <- apply(genotypes, 2, function(x){ (2*sum(x==0) + sum(x==1) ) / (2*nrow(genotypes)) } )
hist(freq, las=1, cex.axis=.8, xlab="Allele frequency of the first allele", main="", col="lightblue", breaks=20, xlim=c(0,1), ylab="No SNPs")
abline(v=0.05, lwd=2,lty=2,col="darkred")
abline(v=0.95, lwd=2,lty=2,col="darkred")
Answer:
Answer:
dat <- cbind(mouse, genotypes[,1])
colnames(dat)[ncol(dat)] <- "SNP"
fit <- lm(BW ~ SNP + sex + reps, data=dat)
summary(fit)
##
## Call:
## lm(formula = BW ~ SNP + sex + reps, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3156 -2.7668 -0.0638 2.7032 21.6181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.8068 0.2481 136.280 <2e-16 ***
## SNP 0.5333 0.2632 2.026 0.043 *
## sexMale 8.3253 0.2505 33.230 <2e-16 ***
## reps2 -0.3964 0.2962 -1.338 0.181
## reps3 2.7817 0.3158 8.809 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.296 on 1172 degrees of freedom
## Multiple R-squared: 0.514, Adjusted R-squared: 0.5124
## F-statistic: 309.9 on 4 and 1172 DF, p-value: < 2.2e-16
plot(x=dat$SNP+runif(nrow(dat),-.3,.3), y=dat$BW, pch=21, bg="lightblue", ylab="Body weight", xlab="Genotype", xaxt="n",cex.axis=.8, las=1)
axis(side=1, at=c(0,1,2),cex.axis=.8)
abline(lm(BW ~ SNP, data=dat),col="darkorange", lwd=2,lty=2)
mtext(side=3, expression(paste(beta,"=0.53; ", italic(P), "=0.043")), line=.5)
Answer:
ass.res <- vector(length=ncol(genotypes),mode="list")
for(i in 1:ncol(genotypes)){
dat <- cbind(mouse, genotypes[,i])
colnames(dat)[ncol(dat)] <- "SNP"
fit <- lm(BW ~ SNP + sex + reps, data=dat)
ass.res[[i]] <- summary(fit)$coefficients["SNP",]
}
ass.res <- do.call(rbind,ass.res)
rownames(ass.res) <- colnames(genotypes)
head(ass.res)
## Estimate Std. Error t value Pr(>|t|)
## 1 0.5333053 0.2632393 2.0259338 0.04299776
## 2 -0.1754671 0.2091542 -0.8389365 0.40167603
## 3 -0.1175216 0.2017812 -0.5824209 0.56039524
## 4 -0.1627946 0.2000687 -0.8136935 0.41598581
## 5 -0.3741368 0.2009255 -1.8620675 0.06284370
## 6 0.3741368 0.2009255 1.8620675 0.06284370
summary(ass.res)
## Estimate Std. Error t value Pr(>|t|)
## Min. :-1.506333 Min. :0.1574 Min. :-8.581529 Min. :0.00000
## 1st Qu.:-0.274878 1st Qu.:0.1794 1st Qu.:-1.361647 1st Qu.:0.01457
## Median : 0.002049 Median :0.1874 Median : 0.009301 Median :0.14812
## Mean : 0.007404 Mean :0.2052 Mean : 0.057254 Mean :0.29004
## 3rd Qu.: 0.301327 3rd Qu.:0.2114 3rd Qu.: 1.505972 3rd Qu.:0.54749
## Max. : 1.751702 Max. :0.4087 Max. : 9.869937 Max. :0.99720
We can summarise the results from all SNP associations by plotting the \(-log_{10}(P)\)-value from each linear model.
plot(x=1:nrow(ass.res), y=-log10(ass.res[,"Pr(>|t|)"]), xlab=",labels=Genetic variants",
las=1, ylab="-log10(P)", bty="n", pch=21, cex=.8, xaxt="n")
axis(side=1, at=c(1,nrow(ass.res)),cex.axis=.8, labels=NA)
Answer:
The \(P\)-values must be adjusted for multiple testing to control the risk of false positives (Type I errors). When conducting multiple statistical tests, the likelihood of incorrectly rejecting at least one true null hypothesis increases. Adjustment methods, such as Bonferroni or FDR correction, help maintain the overall error rate, ensuring that significant results are not due to random chance. The most simple approach to account for multiple testing is the Bonferroni correction. The Bonferroni corrected significance threshold is obtained by dividing the normal significance level (\(\alpha\)=0.05) by the number of statistical tests performed.
In this example we performed 1177 linear regressions, thus, the new
significance level should be
0.05/nrow(genotypes)
=4.2480884^{-5}. We can modify the plot
from above to highlight SNPs that have a \(P\)-below the Bonferroni corrected
significance level.
cols <- c("white","lightblue")
plot(x=1:nrow(ass.res), y=-log10(ass.res[,"Pr(>|t|)"]), xlab="Genetic variants", las=1, ylab="-log10(P)", bty="n", pch=21, cex=.8,
bg=cols[1+(ass.res[,"Pr(>|t|)"]<0.05/nrow(genotypes))], xaxt="n")
axis(side=1, at=c(1,nrow(ass.res)),cex.axis=.8,labels=NA)
abline(h=-log10(0.05/nrow(genotypes)),lty=2,col="grey")
Answer:
We can quantify the degree of LD by computing the correlation among the SNPs.
r <- cor(genotypes)
r[1:5,1:5]
## 1 2 3 4 5
## 1 1.0000000 -0.67191970 0.31821664 0.31547023 0.2242645
## 2 -0.6719197 1.00000000 0.02638161 0.02615392 -0.4912642
## 3 0.3182166 0.02638161 1.00000000 0.99136935 0.4725584
## 4 0.3154702 0.02615392 0.99136935 1.00000000 0.4793118
## 5 0.2242645 -0.49126424 0.47255845 0.47931179 1.0000000
corrplot(r, method="color", bg="white", col= colorRampPalette(rev(c("#B4362AFF", "#DA5A5AFF","white", "#62AFD7FF", "#233253FF")))(10), tl.pos="n", outline=FALSE, xlab=FALSE, ylab=FALSE, is.corr=T,type = 'lower', diag = FALSE)
Answer:
plot(abs(r[,which.min(ass.res[,"Pr(>|t|)"])]),type="l", las=1, ylab="Absolute correlation", cex.axis=.8, xaxt="n", xlab="")
axis(side=1, at=c(1,which.min(ass.res[,"Pr(>|t|)"]),ncol(r)), labels=c("-241","index","+1513"))
Answer:
locus <- which.min(ass.res[,"Pr(>|t|)"])
locus.range <- c((locus-100):locus:(locus+100))
locus.zoom <- as.data.frame(ass.res[locus.range,])
locus.zoom$ld <- abs(r[locus,locus.range])
rbPal <- colorRampPalette(c("#FCDD23FF", "#F8B100FF", "#CA697CFF", "#AB74CFFF", "#48439BFF"))
locus.zoom$cols <- rbPal(10)[as.numeric(cut(locus.zoom$ld,breaks = 10))]
plot(x=locus.range, y=-log10(locus.zoom[,"Pr(>|t|)"]),xlab="Genetic variants", las=1, ylab="-log10(P)", bty="n", pch=21,
bg=locus.zoom$cols, xaxt="n")
legend("topright",title="LD (abs correlation)",legend=unique(cut(locus.zoom$ld,breaks = 10)),pt.bg =rbPal(10),pch=21,cex=.7)
axis(side=1, at=c((locus-100),locus,(locus+100)),cex.axis=.8,labels=c("","index", ""))
abline(h=-log10(0.05/nrow(genotypes)),lty=2,col="grey")
Based on the paper by Uffelmann et al. (2021) discuss in groups the following questions.