Human Genomics

The aim of this practical is to provide a simple introduction to computing polygenic scores (PGS) of complex traits and diseases. The practical will be a mix of theoretical and practical exercises in R that are used for illustrating/applying the theory presented in the corresponding lecture on PGS:

Data used for computing polygenic scores
Methods used for computing polygenic scores
Methods used for evaluating the predictive ability of the polygenic scores

This practical provides a step-by-step guide to performing basic PGS analyses including the following sessions:

Session 1: Use R for downloading data
Session 2: Prepare and explore phenotype data
Session 3: Prepare and perform quality control of genetic data
Session 4: Compute GWAS summary statistics
Session 5: Compute sparse LD matrices
Session 6: Compute PGS using clumping and thresholding (C+T)

Polygenic scores

Polygenic scores combine information from large numbers of markers across the genome (hundreds to millions) to give a single numerical score for an individual’s relative genetic predisposition for a specific disease or trait on the basis of the DNA variants they have inherited.

For a particular disease or trait a PGS is calculated as: \[PGS=\sum_{i=1}^mX_i \hat{b}_i\] where \(X_i\) is the genotype vector, and \(\hat{b}_i\) the weight of the i’th single genetic marker.

Genomic prediction has been used for many years in animal and plant breeding (e.g., Meuwissen et al. 2001), and genomic prediction has gained popularity during the last decade because of:

Larger GWAS sample size = more precision for effect estimates
Development of methods that combine genome-wide sets of variants
Large biobanks for validation and testing of genetic risk scores
Ability to identify clinically meaningful increases in disease risk predictions

Terminology

Polygenic risk scores, polygenic scores, genomic risk score, genetic scores, genetic predispostion, genetic value, genomic breeding value is (more or less) the same thing.

Complex traits and diseases

For many complex traits and diseases there will be thousands of genetic variants that each contribute with a small effect on the disease risk or quantitative trait. Rare variant with large effects will only explain small proportion of \(h^2\) (low predictive potential). Common variants with small effects can explain larger proportion of \(h^2\) (high predictive potential). The majority of complex traits and common diseases in humans are heritable. The heritability determines the value of using genetics for risk prediction. In general, large data sets are required to obtain accurate marker estimates of small to moderate effects, which also improves the prediction accuracy.

Heritability

The heritability (\(h^2\)) quantify the degree of variation in a phenotypic trait in a population that is due to genetic variation between individuals in that population. It measures how much of the variation of a trait can be attributed to variation of genetic factors, as opposed to variation of environmental factors. The narrow sense heritability is the ratio of additive genetic variance (\(\sigma^2_{a}\)) to the overall phenotypic variance (\(\sigma^2_{y}=\sigma^2_{a}+\sigma^2_{e}\)): \[\begin{align} h^2 &= \sigma^2_{a}/(\sigma^2_a+\sigma^2_e) \end{align}\] A heritability of 0 implies that no genetic effects influence the observed variation in the trait, while a heritability of 1 implies that all of the variation in the trait is explained by the genetic effects. In general, the amount of information provided by the phenotype about the genetic risk is determined by the heritability. Note that heritability is population-specific and a heritability of 0 does not necessarily imply that there is no genetic determinism for the trait.

Brief Introduction to the `qgg` R package

The practical is based on the R package qgg (Rohde et al. (2021, 2022)). This package provides an infrastructure for efficient processing of large-scale genetic and phenotypic data including core functions for:

fitting linear mixed models
constructing genetic relationship matrices
estimating genetic parameters (heritability and correlation)
performing genomic prediction and genetic risk profiling
single or multi-marker association analyses

qgg handles large-scale data by taking advantage of:

multi-core processing using openMP
multithreaded matrix operations implemented in BLAS libraries (e.g., OpenBLAS, ATLAS or MKL)
fast and memory-efficient batch processing of genotype data stored in binary files (i.e., PLINK bedfiles)

You can install qgg from CRAN with:

install.packages("qgg")

Input data/objects commonly used in the `qgg` package

All functions in qgg used for analysis of complex traits relies on a simple data infrastructure that takes the following main input:

y:vector, matrix or list of phenotypes X:design matrix for non-genetic factors W:matrix of centered and scaled genotypes (in memory) Glist:list structure providing information on genotypes, sparse LD, and LD scores (on disk) stat:data frame with marker summary statistics sets:list of sets with marker ids ids:vector of ids of individuals rsids:vector marker marker ids

Session 1: Downloading the data using R

In this practical we will compute PGS based on simulated data. The data consist of disease phenotype, covariates, and SNP data. The data used in this practical are intended for demonstration purposes only.

Load required packages:

library(data.table)

Create (your own) directory for downloading files:

dir.create("./PGS")

Set (your own) working directory for the downloaded files:

setwd("./PGS")

Download PLINK genotype files (bedfile, bimfile, famfile) from github repository:

Genetic data are commonly stored in a binary format (as used by the software PLINK), named .bed-files. These files must be accompanied by .bim (contains information about the genetic variants) and .fam (contains information about the individuals) files. Read more about these file formats here:

url <- "https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/human.bed"
download.file( url=url, mode = "wb",  destfile="human.bed")
url <- "https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/human.bim"
download.file( url=url, destfile="human.bim")
url <- "https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/human.fam"
download.file( url=url, destfile="human.fam")

Note that mode="wb" for downloading the human.bed file. This is needed or otherwise the bed-file will be corrupted. If the data file is corrupted it can cause errors in the analyses.

Download pheno and covar files from github repository;

url <- "https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/human.pheno"
download.file( url=url, destfile="human.pheno")
url <- "https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/human.covar"
download.file( url=url, destfile="human.covar")

Session 2: Preparing the phenotype and covariable data using R

One of the first thing to do is to prepare the phenotypic data used in the analysis. The goal is to understand the variables, how many records the data set contains, how many missing values, what is the variable structure, what are the variable relationships and more.

Several functions can be used (e.g., str(), head(), dim(), table(),is.na()).

library(data.table)

Read phenotype and covariables data files

pheno <- fread(input="human.pheno", 
               data.table=FALSE)

covar <- fread(input="human.covar", 
               data.table=FALSE)

How many observations and which variables do we have in the data set?

To get an overview of the data set you are working with you can use the str() or head() functions:

str(pheno)

## 'data.frame':    5000 obs. of  3 variables:
##  $ V1: chr  "IND1" "IND2" "IND3" "IND4" ...
##  $ V2: chr  "IND1" "IND2" "IND3" "IND4" ...
##  $ V3: int  0 0 1 0 0 1 0 0 1 1 ...

str(covar)

## 'data.frame':    5000 obs. of  14 variables:
##  $ V1 : chr  "IND1000" "IND1001" "IND1002" "IND1003" ...
##  $ V2 : chr  "IND1000" "IND1001" "IND1002" "IND1003" ...
##  $ V3 : int  1 0 0 0 0 1 1 1 1 0 ...
##  $ V4 : num  61.2 62.4 65.6 55.4 64.1 ...
##  $ V5 : num  0.00488 -0.0063 -0.00522 0.02028 -0.00411 ...
##  $ V6 : num  -0.002097 0.000148 -0.000606 0.006778 0.004349 ...
##  $ V7 : num  0.01091 0.00469 0.01709 -0.00306 -0.00839 ...
##  $ V8 : num  0.0056 0.01611 -0.00442 -0.01284 0.01022 ...
##  $ V9 : num  -0.008546 -0.00262 0.034218 -0.003078 -0.000912 ...
##  $ V10: num  0.019736 0.000308 -0.005157 0.012835 0.004788 ...
##  $ V11: num  -0.00408 0.00961 -0.00289 -0.01387 0.00977 ...
##  $ V12: num  0.01435 -0.01479 -0.00901 0.01499 0.00899 ...
##  $ V13: num  -0.00152 -0.00708 0.00706 0.00474 -0.00381 ...
##  $ V14: num  0.003998 0.037763 -0.000359 -0.022958 -0.010249 ...

head(pheno)

##     V1   V2 V3
## 1 IND1 IND1  0
## 2 IND2 IND2  0
## 3 IND3 IND3  1
## 4 IND4 IND4  0
## 5 IND5 IND5  0
## 6 IND6 IND6  1

head(covar)

##        V1      V2 V3       V4        V5        V6        V7        V8        V9
## 1 IND1000 IND1000  1 61.24445  0.004881 -0.002097  0.010908  0.005597 -0.008546
## 2 IND1001 IND1001  0 62.42271 -0.006301  0.000148  0.004692  0.016107 -0.002620
## 3 IND1002 IND1002  0 65.64274 -0.005215 -0.000606  0.017091 -0.004416  0.034218
## 4 IND1003 IND1003  0 55.40703  0.020284  0.006778 -0.003061 -0.012837 -0.003078
## 5 IND1004 IND1004  0 64.10666 -0.004105  0.004349 -0.008388  0.010221 -0.000912
## 6 IND1005 IND1005  1 65.77823 -0.010636 -0.021853 -0.017450  0.010058  0.000502
##         V10       V11       V12       V13       V14
## 1  0.019736 -0.004078  0.014350 -0.001518  0.003998
## 2  0.000308  0.009612 -0.014793 -0.007077  0.037763
## 3 -0.005157 -0.002894 -0.009013  0.007064 -0.000359
## 4  0.012835 -0.013872  0.014987  0.004744 -0.022958
## 5  0.004788  0.009773  0.008988 -0.003814 -0.010249
## 6  0.003807 -0.003007  0.005646 -0.007199 -0.009874

How is the phenotype distributed?

Define the response variable

y <- pheno[,3]
names(y) <- pheno[,1]

Use the histogram and boxplot functions to visualize the distribution of the trait/covariables:

layout(matrix(c(1,2),ncol=2),widths=c(4,4),heights=4)
hist(y,las=1,cex.axis=.8,col="lightblue", ylab="Number of observations")
boxplot(covar[,4]~y, las=1,cex.axis=.8,col="lightblue",ylab="Values for covariate no 4")

Which factors or covariated influence the phenotype?

The exploratory data analysis is the process of analyzing and visualizing the data to get a better understanding of the data. It is not a formal statistical test. Which factors should we include in the statistical model? To best answer these question we can fit a logistic regression model that include these factors in the model.

This can be done using the glm() function:

fit <- glm( y ~ V3+V4+V5+V6+V7+V8+V9+V10+V11+V12+V13+V14, 
            data=covar, family=binomial(link="logit"))
summary(fit)

## 
## Call:
## glm(formula = y ~ V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + 
##     V12 + V13 + V14, family = binomial(link = "logit"), data = covar)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.181957   0.214978  -5.498 3.84e-08 ***
## V3          -0.011066   0.065316  -0.169   0.8655    
## V4           0.001740   0.003818   0.456   0.6487    
## V5           1.309941   2.306100   0.568   0.5700    
## V6          -0.280830   2.307468  -0.122   0.9031    
## V7           2.354268   2.306546   1.021   0.3074    
## V8          -1.349941   2.306494  -0.585   0.5584    
## V9           5.520940   2.311308   2.389   0.0169 *  
## V10         -2.724333   2.307482  -1.181   0.2377    
## V11         -1.929366   2.307348  -0.836   0.4031    
## V12          0.578919   2.308375   0.251   0.8020    
## V13          4.382604   2.307287   1.899   0.0575 .  
## V14         -0.504532   2.306778  -0.219   0.8269    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5645.2  on 4999  degrees of freedom
## Residual deviance: 5631.6  on 4987  degrees of freedom
## AIC: 5657.6
## 
## Number of Fisher Scoring iterations: 4

The exploration (including quality control) of phenotypes and covariables is a key step in quantitative genetic analyses. It is, however, beyond the scope of this practical.

Session 3: Prepare genotype for simulated data

The preparation (including quality control) of genotype data is a key step in quantitative genetic analyses.

library(qgg)

Summarize genotype information in PLINK files

The function gprep() reads genotype information from binary PLINK files, and creates the Glist object that contains general information about the genotypes:

bedfiles <- "human.bed"
bimfiles <- "human.bim"
famfiles <- "human.fam"

Glist <- gprep(study="Example",
               bedfiles=bedfiles,
               bimfiles=bimfiles,
               famfiles=famfiles)
saveRDS(Glist, file="Glist.RDS", compress=FALSE)

Glist <- readRDS("Glist.RDS")

The output from gprep() (Glist) has a list structure that contains information about the genotypes in the binary file. Glist is required for downstream analyses provided in the qgg package. Typically, the Glist is prepared once, and saved as an *.RDS-file. To explore the content of the Glist object:

names(Glist)

##  [1] "study"    "bedfiles" "bimfiles" "famfiles" "ids"      "n"       
##  [7] "rsids"    "mchr"     "a1"       "a2"       "pos"      "chr"     
## [13] "cpra"     "map"      "nmiss"    "af"       "af1"      "af2"     
## [19] "maf"      "hom"      "het"      "n0"       "n1"       "n2"      
## [25] "nchr"

str(Glist)

## List of 25
##  $ study   : chr "Example"
##  $ bedfiles: chr "human.bed"
##  $ bimfiles: chr "human.bim"
##  $ famfiles: chr "human.fam"
##  $ ids     : chr [1:5000] "IND1000" "IND1001" "IND1002" "IND1003" ...
##  $ n       : int 5000
##  $ rsids   :List of 1
##   ..$ : chr [1:50000] "1:568670" "1:761732" "1:830791" "1:867663" ...
##  $ mchr    : int 50000
##  $ a1      :List of 1
##   ..$ : chr [1:50000] "C" "C" "T" "T" ...
##  $ a2      :List of 1
##   ..$ : chr [1:50000] "T" "T" "C" "C" ...
##  $ pos     :List of 1
##   ..$ : num [1:50000] 568670 761732 830791 867663 916549 ...
##  $ chr     :List of 1
##   ..$ : chr [1:50000] "1" "1" "1" "1" ...
##  $ cpra    :List of 1
##   ..$ : chr [1:50000] "1_568670_C_T" "1_761732_C_T" "1_830791_T_C" "1_867663_T_C" ...
##  $ map     :List of 1
##   ..$ : num [1:50000] 0.000384 0.492847 0.620827 0.620827 0.69188 ...
##  $ nmiss   :List of 1
##   ..$ : int [1:50000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ af      :List of 1
##   ..$ : num [1:50000] 0.103 0.355 0.264 0.495 0.475 ...
##  $ af1     :List of 1
##   ..$ : num [1:50000] 0.103 0.355 0.264 0.495 0.475 ...
##  $ af2     :List of 1
##   ..$ : num [1:50000] 0.897 0.645 0.736 0.505 0.525 ...
##  $ maf     :List of 1
##   ..$ : num [1:50000] 0.103 0.355 0.264 0.495 0.475 ...
##  $ hom     :List of 1
##   ..$ : num [1:50000] 0.812 0.539 0.61 0.496 0.498 ...
##  $ het     :List of 1
##   ..$ : num [1:50000] 0.188 0.461 0.39 0.504 0.502 ...
##  $ n0      :List of 1
##   ..$ : int [1:50000] 4015 2071 2704 1266 1371 2305 4983 4229 4816 2667 ...
##  $ n1      :List of 1
##   ..$ : int [1:50000] 942 2303 1948 2518 2510 2169 17 739 180 1988 ...
##  $ n2      :List of 1
##   ..$ : int [1:50000] 43 626 348 1216 1119 526 0 32 4 345 ...
##  $ nchr    : int 1

Quality control of genotype data

In general it advisable to perform quality control of the genotype data. The quality control include removing markers with low genotyping rate, low minor allele frequency, not in Hardy-Weinberg Equilibrium. The function gfilter() can be used for filtering of markers:

rsids <-  gfilter( Glist = Glist,
                   excludeMAF=0.05,
                   excludeMISS=0.05,
                   excludeCGAT=TRUE,
                   excludeINDEL=TRUE,
                   excludeDUPS=TRUE,
                   excludeHWE=1e-12,
                   excludeMHC=FALSE)

The gfilter() function output the number of variants removed in the different quality control steps.

Session 4: Compute GWAS summary statistics

One of the first step in PGS analyses is to generate or obtain GWAS summary statistics. Ideally these will correspond to the most powerful GWAS results available on the phenotype under study. In this example, we will use GWAS on the simulated disease phenotype. We will use only a subset of the data (training data) in the GWAS and the remaining subset of the data (validation data) to assess the accuracy of the polygenic scores. In the example below we only compute summary statistics for the markers that fulfil the quality control criteria.

Define the response variable

y <- pheno[,3]
names(y) <- pheno[,1]

Create design matrix for the explanatory variables

X <- model.matrix(~V3+V4+V5+V6+V7+V8+V9+V10+V11+V12+V13+V14, data=covar)
rownames(X) <- covar$V1
X <- X[names(y),]
sum(names(y)%in%rownames(X))

## [1] 5000

Define training and validation samples

train <- sample(names(y),4000)
valid <- names(y)[!names(y)%in%train]

Computation of GWAS summary statistics

The function glma can be used for computing GWAS summary statistics. Currently this function only fit a simple linear regression model, but we plan to add further modeling approached in a future release.

stat <- glma(y=y[train], X=X[train,], Glist=Glist)

Explore the output (stat) form the `glma` function:

dim(stat)

## [1] 50000    13

head(stat)

##             rsids chr    pos ea nea    eaf             b         seb
## 1:568670 1:568670   1 568670  C   T 0.1028  4.451442e-03 0.007045907
## 1:761732 1:761732   1 761732  C   T 0.3555  2.485590e-03 0.006863325
## 1:830791 1:830791   1 830791  T   C 0.2644  6.439566e-05 0.006886717
## 1:867663 1:867663   1 867663  T   C 0.4950  8.794431e-04 0.006886269
## 1:916549 1:916549   1 916549  A   G 0.4748  6.634862e-03 0.006894612
## 1:956852 1:956852   1 956852  C   T 0.3221 -7.525315e-03 0.006904714
##                  stat         p    n       ww         wy
## 1:568670  0.631777029 0.5275687 3998 3801.177  16.920717
## 1:761732  0.362155351 0.7172551 3998 4006.377   9.958209
## 1:830791  0.009350705 0.9925398 3998 3979.337   0.256252
## 1:867663  0.127709653 0.8983852 3998 3979.838   3.500041
## 1:916549  0.962325705 0.3359442 3998 3969.310  26.335821
## 1:956852 -1.089880816 0.2758313 3998 3957.444 -29.781015

Session 5: Compute sparse LD matrices

Polygenic scoring based on summary statistics require the construction of a reference linkage disequilibrium (LD) correlation matrix. The LD matrix corresponds to the correlation between the genotypes of genetic variants across the genome. Here we use a sparse LD matrix approach using a fixed window approach (e.g. number of markers, 1 cM or 1000kb), which sets LD correlation values outside this window to zero.

The function gprep can be used to compute sparse LD matrices which are stored on disk. The \(r^2\) metric used is the pairwise correlation between markers (allele count alternative allele) in a specified region of the genome. Although this step can be slow unless R is linked to a fast BLAS it is typically only done once (or a few times).

Define filenames for the sparse LD matrices.

ldfiles <- "human.ld"

Compute sparse LD using only the filtered rsids (please see below first!)

Glist <- gprep( Glist,
                task="sparseld",
                msize=1000,
                rsids=rsids,
                ldfiles=ldfiles,
                overwrite=TRUE)
saveRDS(Glist, file="Glist_sparseLD_1k.RDS", compress=FALSE)

Note, it may take long time to compute LD, therefore, you can download the ld-file and updated Glist below. Remember to read the new Glist.

url <- "https://www.dropbox.com/scl/fi/5g9i631gq67wunk6njptr/human.ld?rlkey=00yk4oev7drk33syp6cjjke8j&dl=0"
download.file( url=url, destfile="human.ld")
url <- "https://www.dropbox.com/scl/fi/r23vhgt5okw37gohz71vu/Glist_sparseLD_1k.RDS?rlkey=ktv1p427kct9j13sydnn5k2a3&dl=0"
download.file( url=url, destfile="Glist_sparseLD_1k.RDS")

Session 6: Compute PGS using clumping and thresholding (C+T)

Polygenic scoring using clumping and thresholding is a relative simple and robust method. Linkage disequilibrium makes identifying the contribution from causal independent genetic variants extremely challenging. One way of approximately capturing the right level of causal signal is to perform clumping, which removes markers in ways that only weakly correlated SNPs are retained but preferentially retaining the SNPs most associated with the phenotype under study. The clumping procedure uses a statistic (usually \(P\)-value) to sort the markers by importance (e.g. keeping the most significant ones). It takes the first one (e.g. most significant marker) and removes markers (i.e. set their effect to zero) if they are too correlated (e.g. \(r^2>0.9\)) with this one in a window around it. As opposed to pruning, this procedure makes sure that this marker is never removed, keeping at least one representative marker by region of the genome. Then it goes on with the next most significant marker that has not been removed yet.

Clumping and thresholding

Clumping can be performed using the adjStat()-function in qgg. The input to the function is the summary statistic (stat), information about sparse LD matrices which is in the Glist, a threshold of linkage disequilibrium (e.g. \(r^2=0.9\)) and thresholds for \(P\)-values (threshold = c(0.001, 0.05, ...)):

threshold <- c(0.00001, 0.0001, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.5,1)
statAdj <- adjStat(Glist=Glist, stat=stat, r2=0.9, threshold=threshold)

Explore the output (statAdj) using the head function:

head(statAdj)

##             rsids chr    pos ea nea    eaf             b b_1e.05 b_1e.04
## 1:568670 1:568670   1 568670  C   T 0.1028  4.451442e-03       0       0
## 1:761732 1:761732   1 761732  C   T 0.3555  2.485590e-03       0       0
## 1:830791 1:830791   1 830791  T   C 0.2644  6.439566e-05       0       0
## 1:867663 1:867663   1 867663  T   C 0.4950  8.794431e-04       0       0
## 1:916549 1:916549   1 916549  A   G 0.4748  6.634862e-03       0       0
## 1:956852 1:956852   1 956852  C   T 0.3221 -7.525315e-03       0       0
##          b_0.001 b_0.005 b_0.01 b_0.05 b_0.1 b_0.2        b_0.5           b_1
## 1:568670       0       0      0      0     0     0  0.000000000  4.451442e-03
## 1:761732       0       0      0      0     0     0  0.000000000  2.485590e-03
## 1:830791       0       0      0      0     0     0  0.000000000  6.439566e-05
## 1:867663       0       0      0      0     0     0  0.000000000  8.794431e-04
## 1:916549       0       0      0      0     0     0  0.006634862  6.634862e-03
## 1:956852       0       0      0      0     0     0 -0.007525315 -7.525315e-03

A plot of the un-adjusted marker effect (from the stat data frame) against the adjusted marker effects (from the the statAdj data frame) illustrates that the C+T procedure keep only the most significant marker effects and is setting a large number of marker effects to zero (i.e. remove their effect).

plot( y=statAdj[rownames(stat),"b_0.001"], bg="lightblue",pch=21,las=1, cex.axis=.8,
     x=stat$b,
     xlab="Marginal Effect",
     ylab="Adjusted Effect",
     frame.plot=FALSE, ylim=c(-0.05,0.05), xlim=c(-0.05,0.05),
     main="Shrinkage using C+T \n (p=0.001, r2=0.9)")

Compute polygenic scores

For each of the P-value thresholds chosen in the C+T procedure a PGS is computed as: \[PGS=\sum_{i=1}^mX_i \hat{b}_i\] where \(X_i\) is the genotype vector, and \(\hat{b}_i\) the weight of the i’th single genetic marker. The PGS are computed using the gscore() function. The input to the function is the adjusted summary statistic (adjStat), and information about the genotypes which are in the Glist:

pgs <- gscore(Glist=Glist,stat=statAdj)

Explore polygenic scores

It is always important to explore the PGS computed.

head(pgs)

##                  b    b_1e.05      b_1e.04     b_0.001    b_0.005     b_0.01
## IND1000 -1.0412243 -0.3288854 -0.875368300 -1.95246475 -1.3821925 -1.1443251
## IND1001 -0.6144810 -0.4486602 -1.015548505  0.05018968 -0.1643173  0.3441776
## IND1002 -1.0048697 -1.0522848 -1.075989028 -1.55637597 -1.7488470 -1.6067176
## IND1003  2.1195070 -0.4486602  1.927681199  1.23181559  1.1651841  1.0668963
## IND1004  2.1771254  0.5329041 -0.389215989  1.95246583  2.0914713  1.7115670
## IND1005 -0.6129487 -1.0522848 -0.005249977  1.00026982  0.4433743  0.2504299
##              b_0.05      b_0.1      b_0.2      b_0.5        b_1
## IND1000 -1.44642634 -1.3367114 -0.9957876 -1.0338984 -1.0412243
## IND1001 -0.42121050 -0.3408688 -0.5442877 -0.5608317 -0.6144810
## IND1002 -1.43573292 -1.4554880 -1.2180945 -0.9258928 -1.0048697
## IND1003  1.60900936  1.8571193  1.9642979  2.1414632  2.1195070
## IND1004  2.30023211  2.7352337  2.3905093  2.2214144  2.1771254
## IND1005  0.05146917 -0.3235978 -0.2582344 -0.6457216 -0.6129487

cor(pgs)

##                 b   b_1e.05   b_1e.04   b_0.001   b_0.005    b_0.01    b_0.05
## b       1.0000000 0.1493624 0.2588749 0.4203741 0.6168561 0.6998325 0.8864956
## b_1e.05 0.1493624 1.0000000 0.6153340 0.3666984 0.2395178 0.2233771 0.1747886
## b_1e.04 0.2588749 0.6153340 1.0000000 0.6199209 0.4103268 0.3668911 0.2924481
## b_0.001 0.4203741 0.3666984 0.6199209 1.0000000 0.6772010 0.5925342 0.4758374
## b_0.005 0.6168561 0.2395178 0.4103268 0.6772010 1.0000000 0.8721669 0.6964916
## b_0.01  0.6998325 0.2233771 0.3668911 0.5925342 0.8721669 1.0000000 0.7920971
## b_0.05  0.8864956 0.1747886 0.2924481 0.4758374 0.6964916 0.7920971 1.0000000
## b_0.1   0.9344395 0.1648044 0.2760368 0.4477530 0.6591185 0.7503000 0.9462316
## b_0.2   0.9711219 0.1516116 0.2646875 0.4314890 0.6356136 0.7226863 0.9124163
## b_0.5   0.9956874 0.1498501 0.2593980 0.4217727 0.6188635 0.7028113 0.8905765
## b_1     1.0000000 0.1493624 0.2588749 0.4203741 0.6168561 0.6998325 0.8864956
##             b_0.1     b_0.2     b_0.5       b_1
## b       0.9344395 0.9711219 0.9956874 1.0000000
## b_1e.05 0.1648044 0.1516116 0.1498501 0.1493624
## b_1e.04 0.2760368 0.2646875 0.2593980 0.2588749
## b_0.001 0.4477530 0.4314890 0.4217727 0.4203741
## b_0.005 0.6591185 0.6356136 0.6188635 0.6168561
## b_0.01  0.7503000 0.7226863 0.7028113 0.6998325
## b_0.05  0.9462316 0.9124163 0.8905765 0.8864956
## b_0.1   1.0000000 0.9625530 0.9381768 0.9344395
## b_0.2   0.9625530 1.0000000 0.9751100 0.9711219
## b_0.5   0.9381768 0.9751100 1.0000000 0.9956874
## b_1     0.9344395 0.9711219 0.9956874 1.0000000

layout(matrix(1:4,ncol=2, byrow=TRUE))
hist(pgs[,"b"], main="PGS raw", las=1, col="lightblue", xlab="PGS - all individuals")
hist(pgs[,"b_0.001"], main="PGS P: 0.0001", las=1, col="lightblue", xlab="PGS - all individuals")
hist(pgs[valid,"b"], main="PGS raw", las=1, col="lightblue", xlab="PGS - validation set")
hist(pgs[valid,"b_0.001"], main="PGS P: 0.0001", las=1, col="lightblue", xlab="PGS - validatoin set")

Evalute polygenic scores

The \(P\)-value threshold that provides the “best-fit” PGS under the C+T method is usually unknown. To approximate the “best-fit” PGS, we can perform a regression between PGS calculated at a range of \(P\)-value thresholds and then select the PGS that explains the highest proportion of phenotypic variance (e.g. R2) or has the highest AUC. This can be achieved using acc()-function as follows:

paCT <- acc(yobs=y[valid], ypred=pgs[valid,], typeoftrait="binary")
paCT

##          Corr    R2 Nagel R2   AUC intercept slope  MSPE
## b       0.109 0.012    0.018 0.573     0.249 0.139 0.333
## b_1e.05 0.067 0.004    0.007 0.545     0.248 0.029 1.189
## b_1e.04 0.097 0.009    0.014 0.570     0.249 0.044 1.100
## b_0.001 0.158 0.025    0.037 0.603     0.247 0.074 0.956
## b_0.005 0.096 0.009    0.014 0.562     0.248 0.051 0.840
## b_0.01  0.122 0.015    0.022 0.578     0.248 0.070 0.738
## b_0.05  0.098 0.010    0.014 0.566     0.248 0.078 0.495
## b_0.1   0.101 0.010    0.015 0.567     0.248 0.093 0.428
## b_0.2   0.105 0.011    0.017 0.572     0.248 0.113 0.374
## b_0.5   0.119 0.014    0.021 0.577     0.249 0.147 0.337
## b_1     0.109 0.012    0.018 0.573     0.249 0.139 0.333

par(xpd=T)
bp <- barplot(paCT[-1,"Corr"], ylab="Correlation (y,PGS)", xlab="P-value cutoff", las=1, cex.axis=.8,col=rev(colorRampPalette(c("#FFFFCCFF", "#FF2A00FF"))(nrow(paCT)-1)),names=NA)
text(x=bp,srt=45,adj=1,y=-.005,labels=gsub("b_","",rownames(paCT)[-1]),cex=.8)

In comparison, we can see how strongly correlated the PGS is with \(y\) for those individuals we utilised within the GWAS.

paCT <- acc(yobs=y[train], ypred=pgs[train,], typeoftrait="binary")

par(xpd=T)
bp <- barplot(paCT[-1,"Corr"], ylab="Correlation (y,PGS)", xlab="P-value cutoff", las=1, cex.axis=.8,col=rev(colorRampPalette(c("#FFFFCCFF", "#FF2A00FF"))(nrow(paCT)-1)),names=NA)
text(x=bp,srt=45,adj=1,y=-.05,labels=gsub("b_","",rownames(paCT)[-1]),cex=.8)

Plot polygenic scores

For visualization, the PGS can be divided into groups (e.g., deciles), and the disease prevalence within each group was computed.

yobs <- y[valid]
ypred <- pgs[names(y[valid]),which.max(paCT[,"AUC"])]

nbin <- 10
qsets <- qgg:::splitWithOverlap( names(ypred)[order(ypred)],length(ypred)/nbin,0)
qy <- sapply(qsets,function(x){mean(yobs[x])})
qg <- sapply(qsets,function(x){mean(ypred[x])})

colfunc <- colorRampPalette(c("lightblue", "darkblue"))

layout(matrix(c(1,2),ncol=2),heights=c(4),widths=c(4,4))
plot(y=qy,x=qg,pch=19,ylab="Proportion of cases",xlab="Mean PGS", col=colfunc(nbin), frame.plot=FALSE, las=1,cex.axis=.8)
plot(y=qy,x=(1:nbin)/nbin,pch=19,ylab="Proportion of cases",xlab="Percentile of PGS", col=colfunc(nbin), frame.plot=FALSE, las=1,cex.axis=.8)

Human Genomics

Session 7: Polygenic Scores (PGS)

Palle Duun Rohde & Peter Sørensen

Polygenic scores

Terminology

Complex traits and diseases

Heritability

Brief Introduction to the `qgg` R package

Input data/objects commonly used in the `qgg` package

Session 1: Downloading the data using R

Load required packages:

Create (your own) directory for downloading files:

Set (your own) working directory for the downloaded files:

Download PLINK genotype files (bedfile, bimfile, famfile) from github repository:

Download pheno and covar files from github repository;

Session 2: Preparing the phenotype and covariable data using R

Read phenotype and covariables data files

How many observations and which variables do we have in the data set?

How is the phenotype distributed?

Which factors or covariated influence the phenotype?

Session 3: Prepare genotype for simulated data

Summarize genotype information in PLINK files

Quality control of genotype data

Session 4: Compute GWAS summary statistics

Define the response variable

Create design matrix for the explanatory variables

Define training and validation samples

Computation of GWAS summary statistics

Explore the output (stat) form the `glma` function:

Session 5: Compute sparse LD matrices

Define filenames for the sparse LD matrices.

Compute sparse LD using only the filtered rsids (please see below first!)

Session 6: Compute PGS using clumping and thresholding (C+T)

Clumping and thresholding

Compute polygenic scores

Explore polygenic scores

Evalute polygenic scores

Plot polygenic scores

Human Genomics

Session 7: Polygenic Scores (PGS)

Palle Duun Rohde & Peter Sørensen

Polygenic scores

Terminology

Complex traits and diseases

Heritability

Brief Introduction to the qgg R package

Input data/objects commonly used in the qgg package

Session 1: Downloading the data using R

Load required packages:

Create (your own) directory for downloading files:

Set (your own) working directory for the downloaded files:

Download PLINK genotype files (bedfile, bimfile, famfile) from github repository:

Download pheno and covar files from github repository;

Session 2: Preparing the phenotype and covariable data using R

Read phenotype and covariables data files

How many observations and which variables do we have in the data set?

How is the phenotype distributed?

Which factors or covariated influence the phenotype?

Session 3: Prepare genotype for simulated data

Summarize genotype information in PLINK files

Quality control of genotype data

Session 4: Compute GWAS summary statistics

Define the response variable

Create design matrix for the explanatory variables

Define training and validation samples

Computation of GWAS summary statistics

Explore the output (stat) form the glma function:

Session 5: Compute sparse LD matrices

Define filenames for the sparse LD matrices.

Compute sparse LD using only the filtered rsids (please see below first!)

Session 6: Compute PGS using clumping and thresholding (C+T)

Clumping and thresholding

Compute polygenic scores

Explore polygenic scores

Evalute polygenic scores

Plot polygenic scores

Brief Introduction to the `qgg` R package

Input data/objects commonly used in the `qgg` package

Explore the output (stat) form the `glma` function: