Human Genomics

Exercise 1

In this small exercise we will use the mouse population data you have used in previous exercises. The dataset includes phenotypic measurements related to growth and obesity, such as body weight and blood glucose levels, along with genetic marker data.

The M16 mouse was developed as an outbred model for early-onset polygenic obesity and diabesity. This was achieved through 27 generations of selective breeding for weight gain between 3 and 6 weeks of age, starting from an outbred ICR base population. The breeding strategy involved selecting the male and female with the highest weight gain within each litter. A parallel ICR control line was maintained through random mating while preserving a similar effective population size. Compared to ICR mice, M16 mice are consistently larger at all ages and exhibit increased body fat percentage, fat cell size and number, as well as larger organ weights. They also display hyperphagia, moderate obesity, and metabolic abnormalities, including hyperglycemia, hyperinsulinemia, and hypercholesterolemia.

The ICR mouse is an albino strain originally derived from SWISS mice and selectively bred by Dr. Hauschka to establish a highly fertile line. Named after the Institute of Cancer Research (ICR) in the USA, where it was developed, this strain has been widely distributed for research. ICR mice are relatively large, docile, and grow well, making them a widely used general-purpose model in various research fields, including toxicity studies, pharmacology, drug efficacy, and immunology.

A large F2 population (n=1181) was established by crossing the M16 and ICR lines. Twelve F1 families resulted from six pair matings of M16 males x ICR females and six pair matings of the reciprocal cross. A total of 55 F1 dams were mated to 11 F1 sires in sets of five F1 full sisters mated to the same F1 sire. These same specific matings were repeated in three consecutive replicates. Thus, the F2 population consisted of 55 full-sib families of up to 24 individuals each and 11 sire families families of up to 120 individuals each. Actual numbers of mice within families varied slightly due to a small number of failed pregnancies. All litters were standardized at birth to eight pups, with approximately equal representation of males and females.

More information about the mouse data can be found in the following publications:

Allan, M.F., Eisen, E.J. and Pomp, D. (2004). The M16 Mouse: An Outbred Animal Model of Early Onset Polygenic Obesity and Diabesity. Obesity Research, 12: 1397-1407. https://doi.org/10.1038/oby.2004.176

Allan, M. F., Eisen E. J and Pomp, D. (2005). Genomic Mapping of Direct and Correlated Responses to Long-Term Selection for Rapid Growth Rate in Mice. Genetics, 170(4): 1863–1877. https://doi.org/10.1534/genetics.105.041319

First, you should read the mouse phenotype and genotype data:

mouse <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mouseqtl.rds"))
ids <- rownames(mouse)
genotypes <- readRDS(url("https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/mousegenotypes_imputed.rds"))
genotypes <- genotypes[ids,]

Question 1: How many genetics variants are there in total in the dataset?

Answer:

ncol(genotypes)

## [1] 1813

Question 2: What is the genotype frequencies of the first SNP?

Answer:

table(genotypes[,1])/nrow(genotypes)

## 
##          0          1          2 
## 0.82497876 0.14358539 0.03143585

Question 3: What is the allele frequencies of the first SNP?

Answer:

p.A <- (2*sum(genotypes[,1]==0) + sum(genotypes[,1]==1) ) / (2*nrow(genotypes))
print(p.A)

## [1] 0.8967715

p.a <- (2*sum(genotypes[,1]==2) + sum(genotypes[,1]==1) ) / (2*nrow(genotypes))
print(p.a)

## [1] 0.1032285

We can check whether we have computed the frequencies correct as we know they should sum to 1

(p.A + p.a) == 1

## [1] TRUE

Question 4: Make a plot showing the distribution of allele frequencies across all SNPs.

Answer:

par(lwd=1.5)
freq <- apply(genotypes, 2, function(x){ (2*sum(x==0) + sum(x==1) ) / (2*nrow(genotypes)) } )
hist(freq, las=1, cex.axis=.8, xlab="Allele frequency of the first allele", main="", col="lightblue", breaks=20, xlim=c(0,1), ylab="No SNPs")

Question 5: What are the minimum and maxium allele frequencies?

Answer:

min(freq)

## [1] 0.0480034

colnames(genotypes)[which.min(freq)]

## [1] "1031"

max(freq)

## [1] 0.9961767

colnames(genotypes)[which.max(freq)]

## [1] "1757"

Question 6: Remove SNPs with frequency below 5%

Answer:

keep <- freq > 0.05 & freq < 0.95 
genotypes <- genotypes[,keep]
par(lwd=1.5)

freq <- apply(genotypes, 2, function(x){ (2*sum(x==0) + sum(x==1) ) / (2*nrow(genotypes)) } )
hist(freq, las=1, cex.axis=.8, xlab="Allele frequency of the first allele", main="", col="lightblue", breaks=20, xlim=c(0,1), ylab="No SNPs")
abline(v=0.05, lwd=2,lty=2,col="darkred")
abline(v=0.95, lwd=2,lty=2,col="darkred")

Question 7: Why is it important to remove SNPs with rare alleles (here rare defined by frequency of 5%) when doing genetic associations?

Answer:

Question 8: Associate body weight with the first SNP. What does the statistics tell about the association between BW and the first SNP?

Answer:

dat <- cbind(mouse, genotypes[,1])
colnames(dat)[ncol(dat)] <- "SNP"

fit <- lm(BW ~ SNP + sex + reps, data=dat)
summary(fit)

## 
## Call:
## lm(formula = BW ~ SNP + sex + reps, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3156  -2.7668  -0.0638   2.7032  21.6181 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.8068     0.2481 136.280   <2e-16 ***
## SNP           0.5333     0.2632   2.026    0.043 *  
## sexMale       8.3253     0.2505  33.230   <2e-16 ***
## reps2        -0.3964     0.2962  -1.338    0.181    
## reps3         2.7817     0.3158   8.809   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.296 on 1172 degrees of freedom
## Multiple R-squared:  0.514,  Adjusted R-squared:  0.5124 
## F-statistic: 309.9 on 4 and 1172 DF,  p-value: < 2.2e-16

plot(x=dat$SNP+runif(nrow(dat),-.3,.3), y=dat$BW, pch=21, bg="lightblue", ylab="Body weight", xlab="Genotype", xaxt="n",cex.axis=.8, las=1)
axis(side=1, at=c(0,1,2),cex.axis=.8)
abline(lm(BW ~ SNP, data=dat),col="darkorange", lwd=2,lty=2)
mtext(side=3, expression(paste(beta,"=0.53; ", italic(P), "=0.043")), line=.5)

Question 9: Associate body weight with all available SNPs. Investigate the results.

Answer:

ass.res <- vector(length=ncol(genotypes),mode="list")

for(i in 1:ncol(genotypes)){
  dat <- cbind(mouse, genotypes[,i])
  colnames(dat)[ncol(dat)] <- "SNP"
  fit <- lm(BW ~ SNP + sex + reps, data=dat)
  ass.res[[i]] <- summary(fit)$coefficients["SNP",]  
}
ass.res <- do.call(rbind,ass.res)
rownames(ass.res) <- colnames(genotypes)

head(ass.res)

##     Estimate Std. Error    t value   Pr(>|t|)
## 1  0.5333053  0.2632393  2.0259338 0.04299776
## 2 -0.1754671  0.2091542 -0.8389365 0.40167603
## 3 -0.1175216  0.2017812 -0.5824209 0.56039524
## 4 -0.1627946  0.2000687 -0.8136935 0.41598581
## 5 -0.3741368  0.2009255 -1.8620675 0.06284370
## 6  0.3741368  0.2009255  1.8620675 0.06284370

summary(ass.res)

##     Estimate           Std. Error        t value             Pr(>|t|)      
##  Min.   :-1.506333   Min.   :0.1574   Min.   :-8.581529   Min.   :0.00000  
##  1st Qu.:-0.274878   1st Qu.:0.1794   1st Qu.:-1.361647   1st Qu.:0.01457  
##  Median : 0.002049   Median :0.1874   Median : 0.009301   Median :0.14812  
##  Mean   : 0.007404   Mean   :0.2052   Mean   : 0.057254   Mean   :0.29004  
##  3rd Qu.: 0.301327   3rd Qu.:0.2114   3rd Qu.: 1.505972   3rd Qu.:0.54749  
##  Max.   : 1.751702   Max.   :0.4087   Max.   : 9.869937   Max.   :0.99720

We can summarise the results from all SNP associations by plotting the \(-log_{10}(P)\)-value from each linear model.

plot(x=1:nrow(ass.res), y=-log10(ass.res[,"Pr(>|t|)"]), xlab=",labels=Genetic variants", 
     las=1, ylab="-log10(P)", bty="n", pch=21, cex=.8, xaxt="n")
axis(side=1, at=c(1,nrow(ass.res)),cex.axis=.8, labels=NA)

Question 10: How many SNPs are associated with body weight?

Answer:

The \(P\)-values must be adjusted for multiple testing to control the risk of false positives (Type I errors). When conducting multiple statistical tests, the likelihood of incorrectly rejecting at least one true null hypothesis increases. Adjustment methods, such as Bonferroni or FDR correction, help maintain the overall error rate, ensuring that significant results are not due to random chance. The most simple approach to account for multiple testing is the Bonferroni correction. The Bonferroni corrected significance threshold is obtained by dividing the normal significance level (\(\alpha\)=0.05) by the number of statistical tests performed.

In this example we performed 1177 linear regressions, thus, the new significance level should be 0.05/nrow(genotypes)=4.2480884^{-5}. We can modify the plot from above to highlight SNPs that have a \(P\)-below the Bonferroni corrected significance level.

cols <- c("white","lightblue")
plot(x=1:nrow(ass.res), y=-log10(ass.res[,"Pr(>|t|)"]), xlab="Genetic variants", las=1, ylab="-log10(P)", bty="n", pch=21, cex=.8,
     bg=cols[1+(ass.res[,"Pr(>|t|)"]<0.05/nrow(genotypes))], xaxt="n")
axis(side=1, at=c(1,nrow(ass.res)),cex.axis=.8,labels=NA)
abline(h=-log10(0.05/nrow(genotypes)),lty=2,col="grey")

Question 11: Are the SNPs tested independent, or are they in LD?

Answer:

We can quantify the degree of LD by computing the correlation among the SNPs.

r <- cor(genotypes)
r[1:5,1:5]

##            1           2          3          4          5
## 1  1.0000000 -0.67191970 0.31821664 0.31547023  0.2242645
## 2 -0.6719197  1.00000000 0.02638161 0.02615392 -0.4912642
## 3  0.3182166  0.02638161 1.00000000 0.99136935  0.4725584
## 4  0.3154702  0.02615392 0.99136935 1.00000000  0.4793118
## 5  0.2242645 -0.49126424 0.47255845 0.47931179  1.0000000

corrplot(r, method="color", bg="white", col= colorRampPalette(rev(c("#B4362AFF", "#DA5A5AFF","white", "#62AFD7FF", "#233253FF")))(10), tl.pos="n", outline=FALSE, xlab=FALSE, ylab=FALSE, is.corr=T,type = 'lower', diag = FALSE)

Question 12: Is there LD between the SNP with strongest association signal and the remaining SNPs?

Answer:

plot(abs(r[,which.min(ass.res[,"Pr(>|t|)"])]),type="l", las=1, ylab="Absolute correlation", cex.axis=.8, xaxt="n", xlab="")
axis(side=1, at=c(1,which.min(ass.res[,"Pr(>|t|)"]),ncol(r)), labels=c("-241","index","+1513"))

Question 13: Does LD affect our interpretation of the genetic associations?

Answer:

locus <- which.min(ass.res[,"Pr(>|t|)"])
locus.range <- c((locus-100):locus:(locus+100))

locus.zoom <- as.data.frame(ass.res[locus.range,])
locus.zoom$ld <- abs(r[locus,locus.range])

rbPal <- colorRampPalette(c("#FCDD23FF", "#F8B100FF", "#CA697CFF", "#AB74CFFF", "#48439BFF"))
locus.zoom$cols <- rbPal(10)[as.numeric(cut(locus.zoom$ld,breaks = 10))]

plot(x=locus.range, y=-log10(locus.zoom[,"Pr(>|t|)"]),xlab="Genetic variants", las=1, ylab="-log10(P)", bty="n", pch=21,
     bg=locus.zoom$cols, xaxt="n")
legend("topright",title="LD (abs correlation)",legend=unique(cut(locus.zoom$ld,breaks = 10)),pt.bg =rbPal(10),pch=21,cex=.7)
axis(side=1, at=c((locus-100),locus,(locus+100)),cex.axis=.8,labels=c("","index", ""))
abline(h=-log10(0.05/nrow(genotypes)),lty=2,col="grey")

Human Genomics

Session 6: GWAS

Palle Duun Rohde

Exercise 1

Question 1: How many genetics variants are there in total in the dataset?

Question 2: What is the genotype frequencies of the first SNP?

Question 3: What is the allele frequencies of the first SNP?

Question 4: Make a plot showing the distribution of allele frequencies across all SNPs.

Question 5: What are the minimum and maxium allele frequencies?

Question 6: Remove SNPs with frequency below 5%

Question 7: Why is it important to remove SNPs with rare alleles (here rare defined by frequency of 5%) when doing genetic associations?

Question 8: Associate body weight with the first SNP. What does the statistics tell about the association between BW and the first SNP?

Question 9: Associate body weight with all available SNPs. Investigate the results.

Question 10: How many SNPs are associated with body weight?

Question 11: Are the SNPs tested independent, or are they in LD?

Question 12: Is there LD between the SNP with strongest association signal and the remaining SNPs?

Question 13: Does LD affect our interpretation of the genetic associations?

Exercise 2