The aim of this practical is to provide a simple introduction to computing polygenic scores (PGS) of complex traits and diseases. The practical will be a mix of theoretical and practical exercises in R that are used for illustrating/applying the theory presented in the corresponding lecture on PGS:

This practical provides a step-by-step guide to performing basic PGS analyses including the following sessions:

Polygenic scores

Polygenic scores combine information from large numbers of markers across the genome (hundreds to millions) to give a single numerical score for an individual’s relative genetic predisposition for a specific disease or trait on the basis of the DNA variants they have inherited.

For a particular disease or trait a PGS is calculated as: \[PGS=\sum_{i=1}^mX_i \hat{b}_i\] where \(X_i\) is the genotype vector, and \(\hat{b}_i\) the weight of the i’th single genetic marker.

Genomic prediction has been used for many years in animal and plant breeding (e.g., Meuwissen et al. 2001), and genomic prediction has gained popularity during the last decade because of:

Terminology

Polygenic risk scores, polygenic scores, genomic risk score, genetic scores, genetic predispostion, genetic value, genomic breeding value is (more or less) the same thing.

Complex traits and diseases

For many complex traits and diseases there will be thousands of genetic variants that each contribute with a small effect on the disease risk or quantitative trait. Rare variant with large effects will only explain small proportion of \(h^2\) (low predictive potential). Common variants with small effects can explain larger proportion of \(h^2\) (high predictive potential). The majority of complex traits and common diseases in humans are heritable. The heritability determines the value of using genetics for risk prediction. In general, large data sets are required to obtain accurate marker estimates of small to moderate effects, which also improves the prediction accuracy.

Heritability

The heritability (\(h^2\)) quantify the degree of variation in a phenotypic trait in a population that is due to genetic variation between individuals in that population. It measures how much of the variation of a trait can be attributed to variation of genetic factors, as opposed to variation of environmental factors. The narrow sense heritability is the ratio of additive genetic variance (\(\sigma^2_{a}\)) to the overall phenotypic variance (\(\sigma^2_{y}=\sigma^2_{a}+\sigma^2_{e}\)): \[\begin{align} h^2 &= \sigma^2_{a}/(\sigma^2_a+\sigma^2_e) \end{align}\] A heritability of 0 implies that no genetic effects influence the observed variation in the trait, while a heritability of 1 implies that all of the variation in the trait is explained by the genetic effects. In general, the amount of information provided by the phenotype about the genetic risk is determined by the heritability. Note that heritability is population-specific and a heritability of 0 does not necessarily imply that there is no genetic determinism for the trait.

Brief Introduction to the qgg R package

The practical is based on the R package qgg (Rohde et al. (2021, 2022)). This package provides an infrastructure for efficient processing of large-scale genetic and phenotypic data including core functions for:

qgg handles large-scale data by taking advantage of:

You can install qgg from CRAN with:

install.packages("qgg")


Input data/objects commonly used in the qgg package

All functions in qgg used for analysis of complex traits relies on a simple data infrastructure that takes the following main input:

y:vector, matrix or list of phenotypes X:design matrix for non-genetic factors W:matrix of centered and scaled genotypes (in memory) Glist:list structure providing information on genotypes, sparse LD, and LD scores (on disk) stat:data frame with marker summary statistics sets:list of sets with marker ids ids:vector of ids of individuals rsids:vector marker marker ids

Session 1: Downloading the data using R

In this practical we will compute PGS based on simulated data. The data consist of disease phenotype, covariates, and SNP data. The data used in this practical are intended for demonstration purposes only.

Load required packages:

library(data.table)


Create (your own) directory for downloading files:

dir.create("./PGS")


Set (your own) working directory for the downloaded files:

setwd("./PGS")


Download pheno and covar files from github repository;

url <- "https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/human.pheno"
download.file( url=url, destfile="human.pheno")
url <- "https://github.com/PDRohde/AAU-human-genomics/raw/main/exercises/human.covar"
download.file( url=url, destfile="human.covar")



Session 2: Preparing the phenotype and covariable data using R

One of the first thing to do is to prepare the phenotypic data used in the analysis. The goal is to understand the variables, how many records the data set contains, how many missing values, what is the variable structure, what are the variable relationships and more.

Several functions can be used (e.g., str(), head(), dim(), table(),is.na()).

library(data.table)


Read phenotype and covariables data files

pheno <- fread(input="human.pheno", 
               data.table=FALSE)


covar <- fread(input="human.covar", 
               data.table=FALSE)


How many observations and which variables do we have in the data set?

To get an overview of the data set you are working with you can use the str() or head() functions:

str(pheno)
## 'data.frame':    5000 obs. of  3 variables:
##  $ V1: chr  "IND1" "IND2" "IND3" "IND4" ...
##  $ V2: chr  "IND1" "IND2" "IND3" "IND4" ...
##  $ V3: int  0 0 1 0 0 1 0 0 1 1 ...
str(covar)
## 'data.frame':    5000 obs. of  14 variables:
##  $ V1 : chr  "IND1000" "IND1001" "IND1002" "IND1003" ...
##  $ V2 : chr  "IND1000" "IND1001" "IND1002" "IND1003" ...
##  $ V3 : int  1 0 0 0 0 1 1 1 1 0 ...
##  $ V4 : num  61.2 62.4 65.6 55.4 64.1 ...
##  $ V5 : num  0.00488 -0.0063 -0.00522 0.02028 -0.00411 ...
##  $ V6 : num  -0.002097 0.000148 -0.000606 0.006778 0.004349 ...
##  $ V7 : num  0.01091 0.00469 0.01709 -0.00306 -0.00839 ...
##  $ V8 : num  0.0056 0.01611 -0.00442 -0.01284 0.01022 ...
##  $ V9 : num  -0.008546 -0.00262 0.034218 -0.003078 -0.000912 ...
##  $ V10: num  0.019736 0.000308 -0.005157 0.012835 0.004788 ...
##  $ V11: num  -0.00408 0.00961 -0.00289 -0.01387 0.00977 ...
##  $ V12: num  0.01435 -0.01479 -0.00901 0.01499 0.00899 ...
##  $ V13: num  -0.00152 -0.00708 0.00706 0.00474 -0.00381 ...
##  $ V14: num  0.003998 0.037763 -0.000359 -0.022958 -0.010249 ...


head(pheno)
##     V1   V2 V3
## 1 IND1 IND1  0
## 2 IND2 IND2  0
## 3 IND3 IND3  1
## 4 IND4 IND4  0
## 5 IND5 IND5  0
## 6 IND6 IND6  1
head(covar)
##        V1      V2 V3       V4        V5        V6        V7        V8        V9
## 1 IND1000 IND1000  1 61.24445  0.004881 -0.002097  0.010908  0.005597 -0.008546
## 2 IND1001 IND1001  0 62.42271 -0.006301  0.000148  0.004692  0.016107 -0.002620
## 3 IND1002 IND1002  0 65.64274 -0.005215 -0.000606  0.017091 -0.004416  0.034218
## 4 IND1003 IND1003  0 55.40703  0.020284  0.006778 -0.003061 -0.012837 -0.003078
## 5 IND1004 IND1004  0 64.10666 -0.004105  0.004349 -0.008388  0.010221 -0.000912
## 6 IND1005 IND1005  1 65.77823 -0.010636 -0.021853 -0.017450  0.010058  0.000502
##         V10       V11       V12       V13       V14
## 1  0.019736 -0.004078  0.014350 -0.001518  0.003998
## 2  0.000308  0.009612 -0.014793 -0.007077  0.037763
## 3 -0.005157 -0.002894 -0.009013  0.007064 -0.000359
## 4  0.012835 -0.013872  0.014987  0.004744 -0.022958
## 5  0.004788  0.009773  0.008988 -0.003814 -0.010249
## 6  0.003807 -0.003007  0.005646 -0.007199 -0.009874


How is the phenotype distributed?

Define the response variable

y <- pheno[,3]
names(y) <- pheno[,1]


Use the histogram and boxplot functions to visualize the distribution of the trait/covariables:

layout(matrix(c(1,2),ncol=2),widths=c(4,4),heights=4)
hist(y,las=1,cex.axis=.8,col="lightblue", ylab="Number of observations")
boxplot(covar[,4]~y, las=1,cex.axis=.8,col="lightblue",ylab="Values for covariate no 4")


Which factors or covariated influence the phenotype?

The exploratory data analysis is the process of analyzing and visualizing the data to get a better understanding of the data. It is not a formal statistical test. Which factors should we include in the statistical model? To best answer these question we can fit a logistic regression model that include these factors in the model.

This can be done using the glm() function:

fit <- glm( y ~ V3+V4+V5+V6+V7+V8+V9+V10+V11+V12+V13+V14, 
            data=covar, family=binomial(link="logit"))
summary(fit)
## 
## Call:
## glm(formula = y ~ V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + 
##     V12 + V13 + V14, family = binomial(link = "logit"), data = covar)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.181957   0.214978  -5.498 3.84e-08 ***
## V3          -0.011066   0.065316  -0.169   0.8655    
## V4           0.001740   0.003818   0.456   0.6487    
## V5           1.309941   2.306100   0.568   0.5700    
## V6          -0.280830   2.307468  -0.122   0.9031    
## V7           2.354268   2.306546   1.021   0.3074    
## V8          -1.349941   2.306494  -0.585   0.5584    
## V9           5.520940   2.311308   2.389   0.0169 *  
## V10         -2.724333   2.307482  -1.181   0.2377    
## V11         -1.929366   2.307348  -0.836   0.4031    
## V12          0.578919   2.308375   0.251   0.8020    
## V13          4.382604   2.307287   1.899   0.0575 .  
## V14         -0.504532   2.306778  -0.219   0.8269    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5645.2  on 4999  degrees of freedom
## Residual deviance: 5631.6  on 4987  degrees of freedom
## AIC: 5657.6
## 
## Number of Fisher Scoring iterations: 4


The exploration (including quality control) of phenotypes and covariables is a key step in quantitative genetic analyses. It is, however, beyond the scope of this practical.

Session 3: Prepare genotype for simulated data

The preparation (including quality control) of genotype data is a key step in quantitative genetic analyses.

library(qgg)


Quality control of genotype data

In general it advisable to perform quality control of the genotype data. The quality control include removing markers with low genotyping rate, low minor allele frequency, not in Hardy-Weinberg Equilibrium. The function gfilter() can be used for filtering of markers:

rsids <-  gfilter( Glist = Glist,
                   excludeMAF=0.05,
                   excludeMISS=0.05,
                   excludeCGAT=TRUE,
                   excludeINDEL=TRUE,
                   excludeDUPS=TRUE,
                   excludeHWE=1e-12,
                   excludeMHC=FALSE)


The gfilter() function output the number of variants removed in the different quality control steps.

Session 4: Compute GWAS summary statistics

One of the first step in PGS analyses is to generate or obtain GWAS summary statistics. Ideally these will correspond to the most powerful GWAS results available on the phenotype under study. In this example, we will use GWAS on the simulated disease phenotype. We will use only a subset of the data (training data) in the GWAS and the remaining subset of the data (validation data) to assess the accuracy of the polygenic scores. In the example below we only compute summary statistics for the markers that fulfil the quality control criteria.

Define the response variable

y <- pheno[,3]
names(y) <- pheno[,1]


Create design matrix for the explanatory variables

X <- model.matrix(~V3+V4+V5+V6+V7+V8+V9+V10+V11+V12+V13+V14, data=covar)
rownames(X) <- covar$V1
X <- X[names(y),]
sum(names(y)%in%rownames(X))
## [1] 5000


Define training and validation samples

train <- sample(names(y),4000)
valid <- names(y)[!names(y)%in%train]


Computation of GWAS summary statistics

The function glma can be used for computing GWAS summary statistics. Currently this function only fit a simple linear regression model, but we plan to add further modeling approached in a future release.

stat <- glma(y=y[train], X=X[train,], Glist=Glist)


Explore the output (stat) form the glma function:

dim(stat)
## [1] 50000    13
head(stat)
##             rsids chr    pos ea nea    eaf             b         seb
## 1:568670 1:568670   1 568670  C   T 0.1028  4.451442e-03 0.007045907
## 1:761732 1:761732   1 761732  C   T 0.3555  2.485590e-03 0.006863325
## 1:830791 1:830791   1 830791  T   C 0.2644  6.439566e-05 0.006886717
## 1:867663 1:867663   1 867663  T   C 0.4950  8.794431e-04 0.006886269
## 1:916549 1:916549   1 916549  A   G 0.4748  6.634862e-03 0.006894612
## 1:956852 1:956852   1 956852  C   T 0.3221 -7.525315e-03 0.006904714
##                  stat         p    n       ww         wy
## 1:568670  0.631777029 0.5275687 3998 3801.177  16.920717
## 1:761732  0.362155351 0.7172551 3998 4006.377   9.958209
## 1:830791  0.009350705 0.9925398 3998 3979.337   0.256252
## 1:867663  0.127709653 0.8983852 3998 3979.838   3.500041
## 1:916549  0.962325705 0.3359442 3998 3969.310  26.335821
## 1:956852 -1.089880816 0.2758313 3998 3957.444 -29.781015



Session 5: Compute sparse LD matrices

Polygenic scoring based on summary statistics require the construction of a reference linkage disequilibrium (LD) correlation matrix. The LD matrix corresponds to the correlation between the genotypes of genetic variants across the genome. Here we use a sparse LD matrix approach using a fixed window approach (e.g. number of markers, 1 cM or 1000kb), which sets LD correlation values outside this window to zero.

The function gprep can be used to compute sparse LD matrices which are stored on disk. The \(r^2\) metric used is the pairwise correlation between markers (allele count alternative allele) in a specified region of the genome. Although this step can be slow unless R is linked to a fast BLAS it is typically only done once (or a few times).

Define filenames for the sparse LD matrices.

ldfiles <- "human.ld"


Compute sparse LD using only the filtered rsids (please see below first!)

Glist <- gprep( Glist,
                task="sparseld",
                msize=1000,
                rsids=rsids,
                ldfiles=ldfiles,
                overwrite=TRUE)
saveRDS(Glist, file="Glist_sparseLD_1k.RDS", compress=FALSE)


Note, it may take long time to compute LD, therefore, you can download the ld-file and updated Glist below. Remember to read the new Glist.

url <- "https://www.dropbox.com/scl/fi/5g9i631gq67wunk6njptr/human.ld?rlkey=00yk4oev7drk33syp6cjjke8j&dl=0"
download.file( url=url, destfile="human.ld")
url <- "https://www.dropbox.com/scl/fi/r23vhgt5okw37gohz71vu/Glist_sparseLD_1k.RDS?rlkey=ktv1p427kct9j13sydnn5k2a3&dl=0"
download.file( url=url, destfile="Glist_sparseLD_1k.RDS")




Session 6: Compute PGS using clumping and thresholding (C+T)

Polygenic scoring using clumping and thresholding is a relative simple and robust method. Linkage disequilibrium makes identifying the contribution from causal independent genetic variants extremely challenging. One way of approximately capturing the right level of causal signal is to perform clumping, which removes markers in ways that only weakly correlated SNPs are retained but preferentially retaining the SNPs most associated with the phenotype under study. The clumping procedure uses a statistic (usually \(P\)-value) to sort the markers by importance (e.g. keeping the most significant ones). It takes the first one (e.g. most significant marker) and removes markers (i.e. set their effect to zero) if they are too correlated (e.g. \(r^2>0.9\)) with this one in a window around it. As opposed to pruning, this procedure makes sure that this marker is never removed, keeping at least one representative marker by region of the genome. Then it goes on with the next most significant marker that has not been removed yet.

Clumping and thresholding

Clumping can be performed using the adjStat()-function in qgg. The input to the function is the summary statistic (stat), information about sparse LD matrices which is in the Glist, a threshold of linkage disequilibrium (e.g. \(r^2=0.9\)) and thresholds for \(P\)-values (threshold = c(0.001, 0.05, ...)):

threshold <- c(0.00001, 0.0001, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.5,1)
statAdj <- adjStat(Glist=Glist, stat=stat, r2=0.9, threshold=threshold)


Explore the output (statAdj) using the head function:

head(statAdj)
##             rsids chr    pos ea nea    eaf             b b_1e.05 b_1e.04
## 1:568670 1:568670   1 568670  C   T 0.1028  4.451442e-03       0       0
## 1:761732 1:761732   1 761732  C   T 0.3555  2.485590e-03       0       0
## 1:830791 1:830791   1 830791  T   C 0.2644  6.439566e-05       0       0
## 1:867663 1:867663   1 867663  T   C 0.4950  8.794431e-04       0       0
## 1:916549 1:916549   1 916549  A   G 0.4748  6.634862e-03       0       0
## 1:956852 1:956852   1 956852  C   T 0.3221 -7.525315e-03       0       0
##          b_0.001 b_0.005 b_0.01 b_0.05 b_0.1 b_0.2        b_0.5           b_1
## 1:568670       0       0      0      0     0     0  0.000000000  4.451442e-03
## 1:761732       0       0      0      0     0     0  0.000000000  2.485590e-03
## 1:830791       0       0      0      0     0     0  0.000000000  6.439566e-05
## 1:867663       0       0      0      0     0     0  0.000000000  8.794431e-04
## 1:916549       0       0      0      0     0     0  0.006634862  6.634862e-03
## 1:956852       0       0      0      0     0     0 -0.007525315 -7.525315e-03


A plot of the un-adjusted marker effect (from the stat data frame) against the adjusted marker effects (from the the statAdj data frame) illustrates that the C+T procedure keep only the most significant marker effects and is setting a large number of marker effects to zero (i.e. remove their effect).

plot( y=statAdj[rownames(stat),"b_0.001"], bg="lightblue",pch=21,las=1, cex.axis=.8,
     x=stat$b,
     xlab="Marginal Effect",
     ylab="Adjusted Effect",
     frame.plot=FALSE, ylim=c(-0.05,0.05), xlim=c(-0.05,0.05),
     main="Shrinkage using C+T \n (p=0.001, r2=0.9)")


Compute polygenic scores

For each of the P-value thresholds chosen in the C+T procedure a PGS is computed as: \[PGS=\sum_{i=1}^mX_i \hat{b}_i\] where \(X_i\) is the genotype vector, and \(\hat{b}_i\) the weight of the i’th single genetic marker. The PGS are computed using the gscore() function. The input to the function is the adjusted summary statistic (adjStat), and information about the genotypes which are in the Glist:

pgs <- gscore(Glist=Glist,stat=statAdj)


Explore polygenic scores

It is always important to explore the PGS computed.

head(pgs)
##                  b    b_1e.05      b_1e.04     b_0.001    b_0.005     b_0.01
## IND1000 -1.0412243 -0.3288854 -0.875368300 -1.95246475 -1.3821925 -1.1443251
## IND1001 -0.6144810 -0.4486602 -1.015548505  0.05018968 -0.1643173  0.3441776
## IND1002 -1.0048697 -1.0522848 -1.075989028 -1.55637597 -1.7488470 -1.6067176
## IND1003  2.1195070 -0.4486602  1.927681199  1.23181559  1.1651841  1.0668963
## IND1004  2.1771254  0.5329041 -0.389215989  1.95246583  2.0914713  1.7115670
## IND1005 -0.6129487 -1.0522848 -0.005249977  1.00026982  0.4433743  0.2504299
##              b_0.05      b_0.1      b_0.2      b_0.5        b_1
## IND1000 -1.44642634 -1.3367114 -0.9957876 -1.0338984 -1.0412243
## IND1001 -0.42121050 -0.3408688 -0.5442877 -0.5608317 -0.6144810
## IND1002 -1.43573292 -1.4554880 -1.2180945 -0.9258928 -1.0048697
## IND1003  1.60900936  1.8571193  1.9642979  2.1414632  2.1195070
## IND1004  2.30023211  2.7352337  2.3905093  2.2214144  2.1771254
## IND1005  0.05146917 -0.3235978 -0.2582344 -0.6457216 -0.6129487
cor(pgs)
##                 b   b_1e.05   b_1e.04   b_0.001   b_0.005    b_0.01    b_0.05
## b       1.0000000 0.1493624 0.2588749 0.4203741 0.6168561 0.6998325 0.8864956
## b_1e.05 0.1493624 1.0000000 0.6153340 0.3666984 0.2395178 0.2233771 0.1747886
## b_1e.04 0.2588749 0.6153340 1.0000000 0.6199209 0.4103268 0.3668911 0.2924481
## b_0.001 0.4203741 0.3666984 0.6199209 1.0000000 0.6772010 0.5925342 0.4758374
## b_0.005 0.6168561 0.2395178 0.4103268 0.6772010 1.0000000 0.8721669 0.6964916
## b_0.01  0.6998325 0.2233771 0.3668911 0.5925342 0.8721669 1.0000000 0.7920971
## b_0.05  0.8864956 0.1747886 0.2924481 0.4758374 0.6964916 0.7920971 1.0000000
## b_0.1   0.9344395 0.1648044 0.2760368 0.4477530 0.6591185 0.7503000 0.9462316
## b_0.2   0.9711219 0.1516116 0.2646875 0.4314890 0.6356136 0.7226863 0.9124163
## b_0.5   0.9956874 0.1498501 0.2593980 0.4217727 0.6188635 0.7028113 0.8905765
## b_1     1.0000000 0.1493624 0.2588749 0.4203741 0.6168561 0.6998325 0.8864956
##             b_0.1     b_0.2     b_0.5       b_1
## b       0.9344395 0.9711219 0.9956874 1.0000000
## b_1e.05 0.1648044 0.1516116 0.1498501 0.1493624
## b_1e.04 0.2760368 0.2646875 0.2593980 0.2588749
## b_0.001 0.4477530 0.4314890 0.4217727 0.4203741
## b_0.005 0.6591185 0.6356136 0.6188635 0.6168561
## b_0.01  0.7503000 0.7226863 0.7028113 0.6998325
## b_0.05  0.9462316 0.9124163 0.8905765 0.8864956
## b_0.1   1.0000000 0.9625530 0.9381768 0.9344395
## b_0.2   0.9625530 1.0000000 0.9751100 0.9711219
## b_0.5   0.9381768 0.9751100 1.0000000 0.9956874
## b_1     0.9344395 0.9711219 0.9956874 1.0000000
layout(matrix(1:4,ncol=2, byrow=TRUE))
hist(pgs[,"b"], main="PGS raw", las=1, col="lightblue", xlab="PGS - all individuals")
hist(pgs[,"b_0.001"], main="PGS P: 0.0001", las=1, col="lightblue", xlab="PGS - all individuals")
hist(pgs[valid,"b"], main="PGS raw", las=1, col="lightblue", xlab="PGS - validation set")
hist(pgs[valid,"b_0.001"], main="PGS P: 0.0001", las=1, col="lightblue", xlab="PGS - validatoin set")


Evalute polygenic scores

The \(P\)-value threshold that provides the “best-fit” PGS under the C+T method is usually unknown. To approximate the “best-fit” PGS, we can perform a regression between PGS calculated at a range of \(P\)-value thresholds and then select the PGS that explains the highest proportion of phenotypic variance (e.g. R2) or has the highest AUC. This can be achieved using acc()-function as follows:

paCT <- acc(yobs=y[valid], ypred=pgs[valid,], typeoftrait="binary")
paCT
##          Corr    R2 Nagel R2   AUC intercept slope  MSPE
## b       0.109 0.012    0.018 0.573     0.249 0.139 0.333
## b_1e.05 0.067 0.004    0.007 0.545     0.248 0.029 1.189
## b_1e.04 0.097 0.009    0.014 0.570     0.249 0.044 1.100
## b_0.001 0.158 0.025    0.037 0.603     0.247 0.074 0.956
## b_0.005 0.096 0.009    0.014 0.562     0.248 0.051 0.840
## b_0.01  0.122 0.015    0.022 0.578     0.248 0.070 0.738
## b_0.05  0.098 0.010    0.014 0.566     0.248 0.078 0.495
## b_0.1   0.101 0.010    0.015 0.567     0.248 0.093 0.428
## b_0.2   0.105 0.011    0.017 0.572     0.248 0.113 0.374
## b_0.5   0.119 0.014    0.021 0.577     0.249 0.147 0.337
## b_1     0.109 0.012    0.018 0.573     0.249 0.139 0.333


par(xpd=T)
bp <- barplot(paCT[-1,"Corr"], ylab="Correlation (y,PGS)", xlab="P-value cutoff", las=1, cex.axis=.8,col=rev(colorRampPalette(c("#FFFFCCFF", "#FF2A00FF"))(nrow(paCT)-1)),names=NA)
text(x=bp,srt=45,adj=1,y=-.005,labels=gsub("b_","",rownames(paCT)[-1]),cex=.8)


In comparison, we can see how strongly correlated the PGS is with \(y\) for those individuals we utilised within the GWAS.

paCT <- acc(yobs=y[train], ypred=pgs[train,], typeoftrait="binary")


par(xpd=T)
bp <- barplot(paCT[-1,"Corr"], ylab="Correlation (y,PGS)", xlab="P-value cutoff", las=1, cex.axis=.8,col=rev(colorRampPalette(c("#FFFFCCFF", "#FF2A00FF"))(nrow(paCT)-1)),names=NA)
text(x=bp,srt=45,adj=1,y=-.05,labels=gsub("b_","",rownames(paCT)[-1]),cex=.8)


Plot polygenic scores

For visualization, the PGS can be divided into groups (e.g., deciles), and the disease prevalence within each group was computed.

yobs <- y[valid]
ypred <- pgs[names(y[valid]),which.max(paCT[,"AUC"])]

nbin <- 10
qsets <- qgg:::splitWithOverlap( names(ypred)[order(ypred)],length(ypred)/nbin,0)
qy <- sapply(qsets,function(x){mean(yobs[x])})
qg <- sapply(qsets,function(x){mean(ypred[x])})

colfunc <- colorRampPalette(c("lightblue", "darkblue"))

layout(matrix(c(1,2),ncol=2),heights=c(4),widths=c(4,4))
plot(y=qy,x=qg,pch=19,ylab="Proportion of cases",xlab="Mean PGS", col=colfunc(nbin), frame.plot=FALSE, las=1,cex.axis=.8)
plot(y=qy,x=(1:nbin)/nbin,pch=19,ylab="Proportion of cases",xlab="Percentile of PGS", col=colfunc(nbin), frame.plot=FALSE, las=1,cex.axis=.8)