The aim of this practical is to provide a simple introduction to computing polygenic scores (PGS) of complex traits and diseases. The practical will be a mix of theoretical and practical exercises in R that are used for illustrating/applying the theory presented in the corresponding lecture on PGS:
This practical provides a step-by-step guide to performing basic PGS analyses including the following sessions:
Polygenic scores combine information from large numbers of markers across the genome (hundreds to millions) to give a single numerical score for an individual’s relative genetic predisposition for a specific disease or trait on the basis of the DNA variants they have inherited.
For a particular disease or trait a PGS is calculated as: \[PGS=\sum_{i=1}^mX_i \hat{b}_i\] where \(X_i\) is the genotype vector, and \(\hat{b}_i\) the weight of the i’th single genetic marker.
Genomic prediction has been used for many years in animal and plant breeding (e.g., Meuwissen et al. 2001), and genomic prediction has gained popularity during the last decade because of:
Polygenic risk scores, polygenic scores, genomic risk score, genetic
scores, genetic predispostion, genetic value, genomic breeding value is
(more or less) the same thing.
For many complex traits and diseases there will be thousands of
genetic variants that each contribute with a small effect on the disease
risk or quantitative trait. Rare variant with large effects will only
explain small proportion of \(h^2\)
(low predictive potential). Common variants with small effects can
explain larger proportion of \(h^2\)
(high predictive potential). The majority of complex traits and common
diseases in humans are heritable. The heritability determines the value
of using genetics for risk prediction. In general, large data sets are
required to obtain accurate marker estimates of small to moderate
effects, which also improves the prediction accuracy.
The heritability (\(h^2\)) quantify
the degree of variation in a phenotypic trait in a population that is
due to genetic variation between individuals in that population. It
measures how much of the variation of a trait can be attributed to
variation of genetic factors, as opposed to variation of environmental
factors. The narrow sense heritability is the ratio of additive genetic
variance (\(\sigma^2_{a}\)) to the
overall phenotypic variance (\(\sigma^2_{y}=\sigma^2_{a}+\sigma^2_{e}\)):
\[\begin{align}
h^2 &= \sigma^2_{a}/(\sigma^2_a+\sigma^2_e)
\end{align}\] A heritability of 0 implies that no genetic effects
influence the observed variation in the trait, while a heritability of 1
implies that all of the variation in the trait is explained by the
genetic effects. In general, the amount of information provided by the
phenotype about the genetic risk is determined by the heritability. Note
that heritability is population-specific and a heritability of 0 does
not necessarily imply that there is no genetic determinism for the
trait.
qgg
R packageThe practical is based on the R package qgg
(Rohde et
al. (2021, 2022)). This package provides an infrastructure for efficient
processing of large-scale genetic and phenotypic data including core
functions for:
qgg
handles large-scale data by taking advantage of:
You can install qgg from CRAN with:
install.packages("qgg")
qgg
packageAll functions in qgg
used for analysis of complex traits
relies on a simple data infrastructure that takes the following main
input:
y
:vector, matrix or list of phenotypes
X
:design matrix for non-genetic factors
W
:matrix of centered and scaled genotypes (in memory)
Glist
:list structure providing information on genotypes,
sparse LD, and LD scores (on disk) stat
:data frame with
marker summary statistics sets
:list of sets with marker ids
ids
:vector of ids of individuals rsids
:vector
marker marker ids
In this practical we will compute PGS based on simulated data. The
data consist of disease phenotype, covariates, and SNP data. The data
used in this practical are intended for demonstration purposes only.
library(data.table)
dir.create("./PGS")
setwd("./PGS")
Genetic data are commonly stored in a binary format (as used by the
software PLINK), named .bed
-files. These files must be
accompanied by .bim
(contains information about the genetic
variants) and .fam
(contains information about the
individuals) files. Read more about these file formats here:
url <- "https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/human.bed"
download.file( url=url, mode = "wb", destfile="human.bed")
url <- "https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/human.bim"
download.file( url=url, destfile="human.bim")
url <- "https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/human.fam"
download.file( url=url, destfile="human.fam")
Note that mode="wb"
for downloading the human.bed
file. This is needed or otherwise the bed-file will be corrupted. If the
data file is corrupted it can cause errors in the analyses.
url <- "https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/human.pheno"
download.file( url=url, destfile="human.pheno")
url <- "https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/human.covar"
download.file( url=url, destfile="human.covar")
One of the first thing to do is to prepare the phenotypic data used in the analysis. The goal is to understand the variables, how many records the data set contains, how many missing values, what is the variable structure, what are the variable relationships and more.
Several functions can be used (e.g., str()
,
head()
, dim()
,
table()
,is.na()
).
library(data.table)
pheno <- fread(input="human.pheno",
data.table=FALSE)
covar <- fread(input="human.covar",
data.table=FALSE)
To get an overview of the data set you are working with you can use
the str()
or head()
functions:
str(pheno)
## 'data.frame': 5000 obs. of 3 variables:
## $ V1: chr "IND1" "IND2" "IND3" "IND4" ...
## $ V2: chr "IND1" "IND2" "IND3" "IND4" ...
## $ V3: int 0 0 1 0 0 1 0 0 1 1 ...
str(covar)
## 'data.frame': 5000 obs. of 14 variables:
## $ V1 : chr "IND1000" "IND1001" "IND1002" "IND1003" ...
## $ V2 : chr "IND1000" "IND1001" "IND1002" "IND1003" ...
## $ V3 : int 1 0 0 0 0 1 1 1 1 0 ...
## $ V4 : num 61.2 62.4 65.6 55.4 64.1 ...
## $ V5 : num 0.00488 -0.0063 -0.00522 0.02028 -0.00411 ...
## $ V6 : num -0.002097 0.000148 -0.000606 0.006778 0.004349 ...
## $ V7 : num 0.01091 0.00469 0.01709 -0.00306 -0.00839 ...
## $ V8 : num 0.0056 0.01611 -0.00442 -0.01284 0.01022 ...
## $ V9 : num -0.008546 -0.00262 0.034218 -0.003078 -0.000912 ...
## $ V10: num 0.019736 0.000308 -0.005157 0.012835 0.004788 ...
## $ V11: num -0.00408 0.00961 -0.00289 -0.01387 0.00977 ...
## $ V12: num 0.01435 -0.01479 -0.00901 0.01499 0.00899 ...
## $ V13: num -0.00152 -0.00708 0.00706 0.00474 -0.00381 ...
## $ V14: num 0.003998 0.037763 -0.000359 -0.022958 -0.010249 ...
head(pheno)
## V1 V2 V3
## 1 IND1 IND1 0
## 2 IND2 IND2 0
## 3 IND3 IND3 1
## 4 IND4 IND4 0
## 5 IND5 IND5 0
## 6 IND6 IND6 1
head(covar)
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1 IND1000 IND1000 1 61.24445 0.004881 -0.002097 0.010908 0.005597 -0.008546
## 2 IND1001 IND1001 0 62.42271 -0.006301 0.000148 0.004692 0.016107 -0.002620
## 3 IND1002 IND1002 0 65.64274 -0.005215 -0.000606 0.017091 -0.004416 0.034218
## 4 IND1003 IND1003 0 55.40703 0.020284 0.006778 -0.003061 -0.012837 -0.003078
## 5 IND1004 IND1004 0 64.10666 -0.004105 0.004349 -0.008388 0.010221 -0.000912
## 6 IND1005 IND1005 1 65.77823 -0.010636 -0.021853 -0.017450 0.010058 0.000502
## V10 V11 V12 V13 V14
## 1 0.019736 -0.004078 0.014350 -0.001518 0.003998
## 2 0.000308 0.009612 -0.014793 -0.007077 0.037763
## 3 -0.005157 -0.002894 -0.009013 0.007064 -0.000359
## 4 0.012835 -0.013872 0.014987 0.004744 -0.022958
## 5 0.004788 0.009773 0.008988 -0.003814 -0.010249
## 6 0.003807 -0.003007 0.005646 -0.007199 -0.009874
Define the response variable
y <- pheno[,3]
names(y) <- pheno[,1]
Use the histogram and boxplot functions to visualize the distribution of the trait/covariables:
layout(matrix(c(1,2),ncol=2),widths=c(4,4),heights=4)
hist(y,las=1,cex.axis=.8,col="lightblue", ylab="Number of observations")
boxplot(covar[,4]~y, las=1,cex.axis=.8,col="lightblue",ylab="Values for covariate no 4")
The exploratory data analysis is the process of analyzing and visualizing the data to get a better understanding of the data. It is not a formal statistical test. Which factors should we include in the statistical model? To best answer these question we can fit a logistic regression model that include these factors in the model.
This can be done using the glm()
function:
fit <- glm( y ~ V3+V4+V5+V6+V7+V8+V9+V10+V11+V12+V13+V14,
data=covar, family=binomial(link="logit"))
summary(fit)
##
## Call:
## glm(formula = y ~ V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 +
## V12 + V13 + V14, family = binomial(link = "logit"), data = covar)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.181957 0.214978 -5.498 3.84e-08 ***
## V3 -0.011066 0.065316 -0.169 0.8655
## V4 0.001740 0.003818 0.456 0.6487
## V5 1.309941 2.306100 0.568 0.5700
## V6 -0.280830 2.307468 -0.122 0.9031
## V7 2.354268 2.306546 1.021 0.3074
## V8 -1.349941 2.306494 -0.585 0.5584
## V9 5.520940 2.311308 2.389 0.0169 *
## V10 -2.724333 2.307482 -1.181 0.2377
## V11 -1.929366 2.307348 -0.836 0.4031
## V12 0.578919 2.308375 0.251 0.8020
## V13 4.382604 2.307287 1.899 0.0575 .
## V14 -0.504532 2.306778 -0.219 0.8269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5645.2 on 4999 degrees of freedom
## Residual deviance: 5631.6 on 4987 degrees of freedom
## AIC: 5657.6
##
## Number of Fisher Scoring iterations: 4
The exploration (including quality control) of phenotypes and
covariables is a key step in quantitative genetic analyses. It is,
however, beyond the scope of this practical.
The preparation (including quality control) of genotype data is a key step in quantitative genetic analyses.
library(qgg)
The function gprep()
reads genotype information from
binary PLINK files, and creates the Glist
object that
contains general information about the genotypes:
bedfiles <- "human.bed"
bimfiles <- "human.bim"
famfiles <- "human.fam"
Glist <- gprep(study="Example",
bedfiles=bedfiles,
bimfiles=bimfiles,
famfiles=famfiles)
saveRDS(Glist, file="Glist.RDS", compress=FALSE)
Glist <- readRDS("Glist.RDS")
The output from gprep()
(Glist
) has a list
structure that contains information about the genotypes in the binary
file. Glist
is required for downstream analyses provided in
the qgg package. Typically, the Glist
is prepared once, and
saved as an *.RDS-file. To explore the content of the Glist
object:
names(Glist)
## [1] "study" "bedfiles" "bimfiles" "famfiles" "ids" "n"
## [7] "rsids" "mchr" "a1" "a2" "pos" "chr"
## [13] "cpra" "map" "nmiss" "af" "af1" "af2"
## [19] "maf" "hom" "het" "n0" "n1" "n2"
## [25] "nchr"
str(Glist)
## List of 25
## $ study : chr "Example"
## $ bedfiles: chr "human.bed"
## $ bimfiles: chr "human.bim"
## $ famfiles: chr "human.fam"
## $ ids : chr [1:5000] "IND1000" "IND1001" "IND1002" "IND1003" ...
## $ n : int 5000
## $ rsids :List of 1
## ..$ : chr [1:50000] "1:568670" "1:761732" "1:830791" "1:867663" ...
## $ mchr : int 50000
## $ a1 :List of 1
## ..$ : chr [1:50000] "C" "C" "T" "T" ...
## $ a2 :List of 1
## ..$ : chr [1:50000] "T" "T" "C" "C" ...
## $ pos :List of 1
## ..$ : num [1:50000] 568670 761732 830791 867663 916549 ...
## $ chr :List of 1
## ..$ : chr [1:50000] "1" "1" "1" "1" ...
## $ cpra :List of 1
## ..$ : chr [1:50000] "1_568670_C_T" "1_761732_C_T" "1_830791_T_C" "1_867663_T_C" ...
## $ map :List of 1
## ..$ : num [1:50000] 0.000384 0.492847 0.620827 0.620827 0.69188 ...
## $ nmiss :List of 1
## ..$ : int [1:50000] 0 0 0 0 0 0 0 0 0 0 ...
## $ af :List of 1
## ..$ : num [1:50000] 0.103 0.355 0.264 0.495 0.475 ...
## $ af1 :List of 1
## ..$ : num [1:50000] 0.103 0.355 0.264 0.495 0.475 ...
## $ af2 :List of 1
## ..$ : num [1:50000] 0.897 0.645 0.736 0.505 0.525 ...
## $ maf :List of 1
## ..$ : num [1:50000] 0.103 0.355 0.264 0.495 0.475 ...
## $ hom :List of 1
## ..$ : num [1:50000] 0.812 0.539 0.61 0.496 0.498 ...
## $ het :List of 1
## ..$ : num [1:50000] 0.188 0.461 0.39 0.504 0.502 ...
## $ n0 :List of 1
## ..$ : int [1:50000] 4015 2071 2704 1266 1371 2305 4983 4229 4816 2667 ...
## $ n1 :List of 1
## ..$ : int [1:50000] 942 2303 1948 2518 2510 2169 17 739 180 1988 ...
## $ n2 :List of 1
## ..$ : int [1:50000] 43 626 348 1216 1119 526 0 32 4 345 ...
## $ nchr : int 1
In general it advisable to perform quality control of the genotype
data. The quality control include removing markers with low genotyping
rate, low minor allele frequency, not in Hardy-Weinberg Equilibrium. The
function gfilter()
can be used for filtering of
markers:
rsids <- gfilter( Glist = Glist,
excludeMAF=0.05,
excludeMISS=0.05,
excludeCGAT=TRUE,
excludeINDEL=TRUE,
excludeDUPS=TRUE,
excludeHWE=1e-12,
excludeMHC=FALSE)
The gfilter()
function output the number of
variants removed in the different quality control steps.
One of the first step in PGS analyses is to generate or obtain GWAS summary statistics. Ideally these will correspond to the most powerful GWAS results available on the phenotype under study. In this example, we will use GWAS on the simulated disease phenotype. We will use only a subset of the data (training data) in the GWAS and the remaining subset of the data (validation data) to assess the accuracy of the polygenic scores. In the example below we only compute summary statistics for the markers that fulfil the quality control criteria.
y <- pheno[,3]
names(y) <- pheno[,1]
X <- model.matrix(~V3+V4+V5+V6+V7+V8+V9+V10+V11+V12+V13+V14, data=covar)
rownames(X) <- covar$V1
X <- X[names(y),]
sum(names(y)%in%rownames(X))
## [1] 5000
train <- sample(names(y),4000)
valid <- names(y)[!names(y)%in%train]
The function glma
can be used for computing GWAS summary
statistics. Currently this function only fit a simple linear regression
model, but we plan to add further modeling approached in a future
release.
stat <- glma(y=y[train], X=X[train,], Glist=Glist)
glma
function:dim(stat)
## [1] 50000 13
head(stat)
## rsids chr pos ea nea eaf b seb
## 1:568670 1:568670 1 568670 C T 0.1028 4.451442e-03 0.007045907
## 1:761732 1:761732 1 761732 C T 0.3555 2.485590e-03 0.006863325
## 1:830791 1:830791 1 830791 T C 0.2644 6.439566e-05 0.006886717
## 1:867663 1:867663 1 867663 T C 0.4950 8.794431e-04 0.006886269
## 1:916549 1:916549 1 916549 A G 0.4748 6.634862e-03 0.006894612
## 1:956852 1:956852 1 956852 C T 0.3221 -7.525315e-03 0.006904714
## stat p n ww wy
## 1:568670 0.631777029 0.5275687 3998 3801.177 16.920717
## 1:761732 0.362155351 0.7172551 3998 4006.377 9.958209
## 1:830791 0.009350705 0.9925398 3998 3979.337 0.256252
## 1:867663 0.127709653 0.8983852 3998 3979.838 3.500041
## 1:916549 0.962325705 0.3359442 3998 3969.310 26.335821
## 1:956852 -1.089880816 0.2758313 3998 3957.444 -29.781015
Polygenic scoring based on summary statistics require the construction of a reference linkage disequilibrium (LD) correlation matrix. The LD matrix corresponds to the correlation between the genotypes of genetic variants across the genome. Here we use a sparse LD matrix approach using a fixed window approach (e.g. number of markers, 1 cM or 1000kb), which sets LD correlation values outside this window to zero.
The function gprep
can be used to compute sparse LD
matrices which are stored on disk. The \(r^2\) metric used is the pairwise
correlation between markers (allele count alternative allele) in a
specified region of the genome. Although this step can be slow unless R
is linked to a fast BLAS it is typically only done once (or a few
times).
ldfiles <- "human.ld"
Glist <- gprep( Glist,
task="sparseld",
msize=1000,
rsids=rsids,
ldfiles=ldfiles,
overwrite=TRUE)
saveRDS(Glist, file="Glist_sparseLD_1k.RDS", compress=FALSE)
Note, it may take long time to compute LD, therefore, you can
download the ld-file and updated Glist below. Remember to read the new
Glist
.
url <- "https://www.dropbox.com/scl/fi/5g9i631gq67wunk6njptr/human.ld?rlkey=00yk4oev7drk33syp6cjjke8j&dl=0"
download.file( url=url, destfile="human.ld")
url <- "https://www.dropbox.com/scl/fi/r23vhgt5okw37gohz71vu/Glist_sparseLD_1k.RDS?rlkey=ktv1p427kct9j13sydnn5k2a3&dl=0"
download.file( url=url, destfile="Glist_sparseLD_1k.RDS")
Polygenic scoring using clumping and thresholding is a relative simple and robust method. Linkage disequilibrium makes identifying the contribution from causal independent genetic variants extremely challenging. One way of approximately capturing the right level of causal signal is to perform clumping, which removes markers in ways that only weakly correlated SNPs are retained but preferentially retaining the SNPs most associated with the phenotype under study. The clumping procedure uses a statistic (usually \(P\)-value) to sort the markers by importance (e.g. keeping the most significant ones). It takes the first one (e.g. most significant marker) and removes markers (i.e. set their effect to zero) if they are too correlated (e.g. \(r^2>0.9\)) with this one in a window around it. As opposed to pruning, this procedure makes sure that this marker is never removed, keeping at least one representative marker by region of the genome. Then it goes on with the next most significant marker that has not been removed yet.
Clumping can be performed using the adjStat()
-function
in qgg
. The input to the function is the summary statistic
(stat
), information about sparse LD matrices which is in
the Glist
, a threshold of linkage disequilibrium
(e.g. \(r^2=0.9\)) and thresholds for
\(P\)-values
(threshold = c(0.001, 0.05, ...)
):
threshold <- c(0.00001, 0.0001, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.5,1)
statAdj <- adjStat(Glist=Glist, stat=stat, r2=0.9, threshold=threshold)
Explore the output (statAdj) using the head
function:
head(statAdj)
## rsids chr pos ea nea eaf b b_1e.05 b_1e.04
## 1:568670 1:568670 1 568670 C T 0.1028 4.451442e-03 0 0
## 1:761732 1:761732 1 761732 C T 0.3555 2.485590e-03 0 0
## 1:830791 1:830791 1 830791 T C 0.2644 6.439566e-05 0 0
## 1:867663 1:867663 1 867663 T C 0.4950 8.794431e-04 0 0
## 1:916549 1:916549 1 916549 A G 0.4748 6.634862e-03 0 0
## 1:956852 1:956852 1 956852 C T 0.3221 -7.525315e-03 0 0
## b_0.001 b_0.005 b_0.01 b_0.05 b_0.1 b_0.2 b_0.5 b_1
## 1:568670 0 0 0 0 0 0 0.000000000 4.451442e-03
## 1:761732 0 0 0 0 0 0 0.000000000 2.485590e-03
## 1:830791 0 0 0 0 0 0 0.000000000 6.439566e-05
## 1:867663 0 0 0 0 0 0 0.000000000 8.794431e-04
## 1:916549 0 0 0 0 0 0 0.006634862 6.634862e-03
## 1:956852 0 0 0 0 0 0 -0.007525315 -7.525315e-03
A plot of the un-adjusted marker effect (from the stat
data frame) against the adjusted marker effects (from the the
statAdj
data frame) illustrates that the C+T procedure keep
only the most significant marker effects and is setting a large number
of marker effects to zero (i.e. remove their effect).
plot( y=statAdj[rownames(stat),"b_0.001"], bg="lightblue",pch=21,las=1, cex.axis=.8,
x=stat$b,
xlab="Marginal Effect",
ylab="Adjusted Effect",
frame.plot=FALSE, ylim=c(-0.05,0.05), xlim=c(-0.05,0.05),
main="Shrinkage using C+T \n (p=0.001, r2=0.9)")
For each of the P
-value thresholds chosen in the C+T
procedure a PGS is computed as: \[PGS=\sum_{i=1}^mX_i \hat{b}_i\] where
\(X_i\) is the genotype vector, and
\(\hat{b}_i\) the weight of the i’th
single genetic marker. The PGS are computed using the
gscore()
function. The input to the function is the
adjusted summary statistic (adjStat
), and information about
the genotypes which are in the Glist
:
pgs <- gscore(Glist=Glist,stat=statAdj)
It is always important to explore the PGS computed.
head(pgs)
## b b_1e.05 b_1e.04 b_0.001 b_0.005 b_0.01
## IND1000 -1.0412243 -0.3288854 -0.875368300 -1.95246475 -1.3821925 -1.1443251
## IND1001 -0.6144810 -0.4486602 -1.015548505 0.05018968 -0.1643173 0.3441776
## IND1002 -1.0048697 -1.0522848 -1.075989028 -1.55637597 -1.7488470 -1.6067176
## IND1003 2.1195070 -0.4486602 1.927681199 1.23181559 1.1651841 1.0668963
## IND1004 2.1771254 0.5329041 -0.389215989 1.95246583 2.0914713 1.7115670
## IND1005 -0.6129487 -1.0522848 -0.005249977 1.00026982 0.4433743 0.2504299
## b_0.05 b_0.1 b_0.2 b_0.5 b_1
## IND1000 -1.44642634 -1.3367114 -0.9957876 -1.0338984 -1.0412243
## IND1001 -0.42121050 -0.3408688 -0.5442877 -0.5608317 -0.6144810
## IND1002 -1.43573292 -1.4554880 -1.2180945 -0.9258928 -1.0048697
## IND1003 1.60900936 1.8571193 1.9642979 2.1414632 2.1195070
## IND1004 2.30023211 2.7352337 2.3905093 2.2214144 2.1771254
## IND1005 0.05146917 -0.3235978 -0.2582344 -0.6457216 -0.6129487
cor(pgs)
## b b_1e.05 b_1e.04 b_0.001 b_0.005 b_0.01 b_0.05
## b 1.0000000 0.1493624 0.2588749 0.4203741 0.6168561 0.6998325 0.8864956
## b_1e.05 0.1493624 1.0000000 0.6153340 0.3666984 0.2395178 0.2233771 0.1747886
## b_1e.04 0.2588749 0.6153340 1.0000000 0.6199209 0.4103268 0.3668911 0.2924481
## b_0.001 0.4203741 0.3666984 0.6199209 1.0000000 0.6772010 0.5925342 0.4758374
## b_0.005 0.6168561 0.2395178 0.4103268 0.6772010 1.0000000 0.8721669 0.6964916
## b_0.01 0.6998325 0.2233771 0.3668911 0.5925342 0.8721669 1.0000000 0.7920971
## b_0.05 0.8864956 0.1747886 0.2924481 0.4758374 0.6964916 0.7920971 1.0000000
## b_0.1 0.9344395 0.1648044 0.2760368 0.4477530 0.6591185 0.7503000 0.9462316
## b_0.2 0.9711219 0.1516116 0.2646875 0.4314890 0.6356136 0.7226863 0.9124163
## b_0.5 0.9956874 0.1498501 0.2593980 0.4217727 0.6188635 0.7028113 0.8905765
## b_1 1.0000000 0.1493624 0.2588749 0.4203741 0.6168561 0.6998325 0.8864956
## b_0.1 b_0.2 b_0.5 b_1
## b 0.9344395 0.9711219 0.9956874 1.0000000
## b_1e.05 0.1648044 0.1516116 0.1498501 0.1493624
## b_1e.04 0.2760368 0.2646875 0.2593980 0.2588749
## b_0.001 0.4477530 0.4314890 0.4217727 0.4203741
## b_0.005 0.6591185 0.6356136 0.6188635 0.6168561
## b_0.01 0.7503000 0.7226863 0.7028113 0.6998325
## b_0.05 0.9462316 0.9124163 0.8905765 0.8864956
## b_0.1 1.0000000 0.9625530 0.9381768 0.9344395
## b_0.2 0.9625530 1.0000000 0.9751100 0.9711219
## b_0.5 0.9381768 0.9751100 1.0000000 0.9956874
## b_1 0.9344395 0.9711219 0.9956874 1.0000000
layout(matrix(1:4,ncol=2, byrow=TRUE))
hist(pgs[,"b"], main="PGS raw", las=1, col="lightblue", xlab="PGS - all individuals")
hist(pgs[,"b_0.001"], main="PGS P: 0.0001", las=1, col="lightblue", xlab="PGS - all individuals")
hist(pgs[valid,"b"], main="PGS raw", las=1, col="lightblue", xlab="PGS - validation set")
hist(pgs[valid,"b_0.001"], main="PGS P: 0.0001", las=1, col="lightblue", xlab="PGS - validatoin set")
The \(P\)-value threshold that
provides the “best-fit” PGS under the C+T method is usually unknown. To
approximate the “best-fit” PGS, we can perform a regression between PGS
calculated at a range of \(P\)-value
thresholds and then select the PGS that explains the highest proportion
of phenotypic variance (e.g. R2) or has the highest AUC. This can be
achieved using acc()
-function as follows:
paCT <- acc(yobs=y[valid], ypred=pgs[valid,], typeoftrait="binary")
paCT
## Corr R2 Nagel R2 AUC intercept slope MSPE
## b 0.109 0.012 0.018 0.573 0.249 0.139 0.333
## b_1e.05 0.067 0.004 0.007 0.545 0.248 0.029 1.189
## b_1e.04 0.097 0.009 0.014 0.570 0.249 0.044 1.100
## b_0.001 0.158 0.025 0.037 0.603 0.247 0.074 0.956
## b_0.005 0.096 0.009 0.014 0.562 0.248 0.051 0.840
## b_0.01 0.122 0.015 0.022 0.578 0.248 0.070 0.738
## b_0.05 0.098 0.010 0.014 0.566 0.248 0.078 0.495
## b_0.1 0.101 0.010 0.015 0.567 0.248 0.093 0.428
## b_0.2 0.105 0.011 0.017 0.572 0.248 0.113 0.374
## b_0.5 0.119 0.014 0.021 0.577 0.249 0.147 0.337
## b_1 0.109 0.012 0.018 0.573 0.249 0.139 0.333
par(xpd=T)
bp <- barplot(paCT[-1,"Corr"], ylab="Correlation (y,PGS)", xlab="P-value cutoff", las=1, cex.axis=.8,col=rev(colorRampPalette(c("#FFFFCCFF", "#FF2A00FF"))(nrow(paCT)-1)),names=NA)
text(x=bp,srt=45,adj=1,y=-.005,labels=gsub("b_","",rownames(paCT)[-1]),cex=.8)
In comparison, we can see how strongly correlated the PGS is with \(y\) for those individuals we utilised within the GWAS.
paCT <- acc(yobs=y[train], ypred=pgs[train,], typeoftrait="binary")
par(xpd=T)
bp <- barplot(paCT[-1,"Corr"], ylab="Correlation (y,PGS)", xlab="P-value cutoff", las=1, cex.axis=.8,col=rev(colorRampPalette(c("#FFFFCCFF", "#FF2A00FF"))(nrow(paCT)-1)),names=NA)
text(x=bp,srt=45,adj=1,y=-.05,labels=gsub("b_","",rownames(paCT)[-1]),cex=.8)
For visualization, the PGS can be divided into groups (e.g., deciles), and the disease prevalence within each group was computed.
yobs <- y[valid]
ypred <- pgs[names(y[valid]),which.max(paCT[,"AUC"])]
nbin <- 10
qsets <- qgg:::splitWithOverlap( names(ypred)[order(ypred)],length(ypred)/nbin,0)
qy <- sapply(qsets,function(x){mean(yobs[x])})
qg <- sapply(qsets,function(x){mean(ypred[x])})
colfunc <- colorRampPalette(c("lightblue", "darkblue"))
layout(matrix(c(1,2),ncol=2),heights=c(4),widths=c(4,4))
plot(y=qy,x=qg,pch=19,ylab="Proportion of cases",xlab="Mean PGS", col=colfunc(nbin), frame.plot=FALSE, las=1,cex.axis=.8)
plot(y=qy,x=(1:nbin)/nbin,pch=19,ylab="Proportion of cases",xlab="Percentile of PGS", col=colfunc(nbin), frame.plot=FALSE, las=1,cex.axis=.8)