Population geneticists analyze the extent and patterns of genetic variation within populations, the distribution of genotypes, and the influence of evolutionary forces like natural selection and genetic drift. These factors drive evolutionary change and speciation. Based on your current understanding, how would you address the following questions?
Which term refers to the complete set of genetic information shared by all individuals within a population?
Various mechanisms help preserve genetic diversity within a population. Why is maintaining this diversity beneficial?
Human serum contains a protein called haptoglobin, which binds to hemoglobin following the lysis of erythrocytes. In some populations, three electrophoretic haptoglobin variants are observed, determined by three autosomal co-dominant alleles: A, B, and C. Among 500 individuals, the following genotype counts were recorded:
No individuals | Total | |||||
---|---|---|---|---|---|---|
AA | AB | AC | BB | BC | CC | |
109 | 123 | 128 | 33 | 70 | 37 | 500 |
Cystic fibrosis (CF) is an autosomal recessive disease. In individuals who are homozygous for the recessive allele, a malfunction in salt transport occurs in certain cell types, such as those in the lungs. This defect leads to thick mucus buildup, increasing the risk of bacterial lung infections.
Historically, only a small number of individuals with CF survived into adulthood. In Europe, the birth prevalence of CF has been approximately 1 in 2,500 newborns. This frequency is maintained by a balance between mutation and natural selection.
In a European sample, cystic fibrosis (CF) carrier status was analyzed based on three genotypes: 16,890 individuals were homozygous for the normal allele (RR), 855 individuals were heterozygous carriers (Rr), and 18 individuals were homozygous for the recessive CF allele (rr).
A geneticist studying a human population discovers that a recessive allele (s) associated with sickle cell disease is present in a community of 1,024 individuals. Individuals carrying two copies of the recessive allele (ss) develop sickle cell disease, while those with at least one dominant allele (S) have normal red blood cells. Within the population, 1 in 256 individuals has sickle cell disease, indicating they are homozygous recessive (ss).
This exercise should be solved using R.
Run every step in the code below, and explain what each step does.
# define alleles
A1 <- "A"
A2 <- "a"
# define frequency of A1
p <- 0.2
allele <- c(rep(A1,p*10),rep(A2,(1-p)*10))
# define population size
N <- 100
# define population
pop <- matrix(nrow=N, ncol=2)
for(i in 1:nrow(pop)){
pop[i, 1] <- sample(allele, 1)
pop[i, 2] <- sample(allele, 1)
}
# combine alleles into genotype
pop <- cbind(pop, paste(pop[,1],pop[,2],sep=""))
We can count how many we have of each genotype using the
table(snp)
function.
# genotype count
obs.count <- table(pop[,3])
# combine the two heterozygote genotypes (Aa and aA is the same)
pop[,3][pop[,3]=="aA"] <- "Aa"
obs.count <- table(pop[,3])
# total number of observations
Nobs <- sum(obs.count)
# genotype frequencies
geno.freq <- obs.count / Nobs
# allele frequencies
p.A <- (2*obs.count["AA"] + obs.count["Aa"]) / (2*Nobs)
p.a <- (2*obs.count["aa"] + obs.count["Aa"]) / (2*Nobs)
# set global parameters
N <- seq(10,1000,10)
rep <- 10
A1 <- "A"
A2 <- "a"
p <- 0.2
allele <- c(rep(A1,p*10),rep(A2,(1-p)*10))
p.A <- matrix(nrow=rep, ncol=length(N))
colnames(p.A) <- N
rownames(p.A) <- paste("rep",1:rep,sep="")
# run simulations
for(i in 1:length(N)){
for(j in 1:rep){
pop <- matrix(nrow=N[i], ncol=2)
for(k in 1:nrow(pop)){
pop[k, 1] <- sample(allele, 1)
pop[k, 2] <- sample(allele, 1)
}
pop <- cbind(pop, paste(pop[,1],pop[,2],sep=""))
pop[,3][pop[,3]=="aA"] <- "Aa"
obs.count <- table(pop[,3])
Nobs <- sum(obs.count)
geno.freq <- obs.count / Nobs
p.A[j,i] <- (2*obs.count["AA"] + obs.count["Aa"]) / (2*Nobs)
}
}
p.A
contains many values that are
NA
?library(ggplot2)
library(reshape2)
p.A <- melt(p.A)
ggplot(p.A[!is.na(p.A$value),], aes(x = Var2, y = value, fill = factor(Var2))) +
geom_hline(yintercept = p, linetype = "dashed", color = "darkorange", linewidth = 0.75) +
geom_boxplot(alpha = 0.6) +
coord_cartesian(ylim = c(0.1, 0.35)) +
labs(x = "N", y = "p(A)", title = "") +
theme_classic() + theme(legend.position = "none")
In the second part of these exercises, we will examine deviations from expected genotype frequencies. According to the Hardy-Weinberg principle, genotype frequencies can be predicted from allele frequencies under certain assumptions. Specifically, if allele A has a frequency of p and allele a has a frequency of q, the expected genotype frequencies are:
Next, we will create a figure that visualizes the predicted genotype frequencies for different values of p. Take the time to review and understand the code used to generate it.
curve(2*x*(1-x),col="darkblue", ylim=c(0,1), lwd=2, xlab="Allele frequency", ylab="Genotype frequency", las=1)
text(x=0.5, y=0.7, labels="Aa", col="darkblue", font=2)
curve(x**2,col="darkgreen", add=T, ylab="NULL", ylim=c(0,1), lwd=2)
text(x=0.9, y=0.7, labels="AA", col="darkgreen", font=2)
curve((1-x)**2,col="darkred",add=T, ylab="NULL", ylim=c(0,1),lwd=2)
text(x=0.1, y=0.7, labels="aa", col="darkred", font=2)
In a study of 10,000 individuals of Danish ancestry, researchers analyzed two genetic loci, including the CFTR gene (associated with cystic fibrosis) and the APOE gene (linked to Alzheimer’s disease risk). These loci were chosen due to their well-established roles in disease susceptibility and health outcomes, providing insights into the genetic factors influencing these conditions in the population.
CFTR | APOE | |||||
---|---|---|---|---|---|---|
WT | WT |
WT | F508del |
F508del | F508del | ε3 | ε3 | ε3 | ε4 | ε4 | ε4 | |
3156 | 4997 | 1847 | 6837 | 3065 | 98 |
Cystic fibrosis (CF) is an autosomal, recessive disease. Homozygotes individuals (F508del|F508del) have reduced function of the salt transporters in some cell types, including the lungs. This entails mucus in the airways which increases the risk of bacterial infections.
Until recently, only few individuals survived to adulthood. In Europe, the frequency of newborns with CF has historical been 1:2500. This rate is maintained by a balance between selection (\(s\)) and mutation (μ).
In a study blood samples were collected to determine the genotype frequency of a certain CF mutation. The disease-causing allele is called F508del, while the non-disease-causing allele is called WT. The genotypes of the enrolled participants are shown in the table below.
CFTR | ||
---|---|---|
WT | WT |
WT | F508del |
F508del | F508del |
16890 | 855 | 18 |
Out of 2,400 births at a UK hospital, 6 newborn deaths were recorded, all of which were attributed to a colon defect caused by a recessive lethal allele (co) that follows an autosomal inheritance pattern. Assuming Hardy-Weinberg equilibrium, answer the following:
What is the frequency of the co allele in the newborn population?
What is the frequency of heterozygous carriers in both newborns
and adults?
Red-green colour blindness is caused by a recessive, X-linked gene (f). In a large population, the frequency of the f allele is denoted as \(q\) in both males and females. Assuming random mating and no selection, answer the following:
Now we will use R to test for Hardy-Weinberg proportion. Run every step in the code below, and explain what each step does.
First we define variables.
# define genotype counts
AA <- 375
Aa <- 165
aa <- 32
# find the total number of individuals
n <- AA+Aa+aa
We can calculate the allele frequencies from the observed genotype counts.
p <- (2*AA + Aa) / (2*n)
q <- c(2*aa + Aa) / (2*n)
# check that we have counted all alleles
p+q == 1
Using the allele frequencies we can compute the expected genotype frequencies if the population is in Hardy-Weinberg equlibrium.
# calculating expected values
E.AA <- n*p**2
E.Aa <- n*2*p*q
E.aa <- n*q**2
# check that we have counted all individuals
E.AA + E.Aa + E.aa == n
We can do a formal statistical test for whether the number of genotypes we observe matches our expectations. This is done using a \(\chi^2\)-test which is defined as: \[\chi^2 = \sum_{i=1}^{geno} = \frac{(E - O)^2}{E}\] where \(E\) is the expected number of counts, and O is the observed number of counts, which is summed across all genotype categories (\(geno\)).
The \(\chi^2\)-statistican can be calculated as:
# compute chi-square test-statistic
chi2 <- (E.AA-AA)^2/E.AA + (E.Aa-Aa)^2/E.Aa + (E.aa-aa)^2/E.aa
Next we have to determine whether the \(\chi^2\)-statistic is larger than expected.
To compute the probability value (i.e., P-value) from a \(\chi^2\)-distribution we can use the
function pchisq()
with the argument
lower.tail=FALSE
because we to see the probability that the
\(\chi^2\) value being higher than it
is. To have an appropriate distribution, we need to add how many degrees
of freedom (df) to consider, \(df=\frac{n(n1)}{n}\), where \(n=\textrm{no. alleles}\). Thus, in our
example, \(df=\frac{2(2-1)}{2}=1\).
# obtain P-value
pval <-pchisq(chi2, df=1, lower.tail=FALSE)
Therefore, the P-value is 0.018, indicating that at a 5% significance level, our sample does not support the Hardy-Weinberg equilibrium. The figure below shows the \(\chi^2\)-distribution with \(df=1\) where the vertical line shows the \(\chi^2\)-value from our analysis.
We can visualise the observed and expected genotype counts to get an idea of why the sample is not in Hardy-Weinberg.
geno <- c(AA, Aa, aa)
expe <- c(E.AA, E.Aa, E.aa)
dat <- matrix(c(geno,expe),nrow=2, byrow=T)
barplot(dat, beside=T, col=c("#6D8325FF", "#BD5630FF"), names.arg=c("AA","Aa","aa"), las=1, cex.axis=.8, xlab="Genotype", ylab="Genotype counts")
legend("topright",legend=c("Observed", "Expected"),pch=15, col=c("#6D8325FF", "#BD5630FF"), cex=.8)
The ability to taste the bitter compound phenylthiocarbamide (PTC) is controlled by a dominant allele (T), meaning individuals with at least one T allele can perceive its bitterness, while t homozygotes cannot. PTC tasting is linked to variation in the TAS2R38 gene, which encodes a bitter taste receptor on the tongue.
PTC itself is not found in foods but is chemically similar to naturally occurring bitter compounds in certain vegetables, such as broccoli, Brussels sprouts, kale, and cabbage. The ability to taste these compounds may have evolutionary significance, as it could help individuals avoid toxic plants that contain bitter-tasting alkaloids.
In this R-exercise, we will investigate, how population size influence the fixation and extinction probabilities of individual alleles.
The functionsim
allow you to simulate
num_populations
with a population size given by
pop_size
(which can vary between the different populations)
for generations
. The starting allele frequency can be
specified for each simulated population with
allele_freqs
.
# Function to simulate allele frequency across generations for multiple populations
sim <- function(pop_size, generations, num_populations = 1, allele_freqs = 0.5) {
# Create a list to store allele frequencies for each population
freq_matrix_list <- vector("list", num_populations)
# Run the simulations
for (p in 1:num_populations) {
allele_freq <- allele_freqs[p]
freq_matrix <- matrix(NA, nrow = generations, ncol = 1)
current_freq <- allele_freq
freq_matrix[1, 1] <- current_freq # Store the initial allele frequency
for (gen in 2:generations) {
# Simulate the allele frequency for the next generation using binomial sampling
a <- rbinom(1, 2*pop_size, current_freq)
current_freq <- a/(2*pop_size)
freq_matrix[gen, 1] <- current_freq
}
freq_matrix_list[[p]] <- freq_matrix
}
return(freq_matrix_list)
}
The function below allow us to run several simulations and collect the results.
run_sim <- function(population_sizes=NULL, generations=NULL, num_populations=NULL, initial_allele_freqs=NULL){
# Store the results in a data frame
results <- data.frame()
# Run the simulation for different population sizes
for (pop_size in population_sizes) {
freq_matrix_list <- sim(pop_size,generations,num_populations,allele_freqs = initial_allele_freqs)
# Prepare the data for plotting
for (p in 1:num_populations) {
freq_matrix <- freq_matrix_list[[p]]
temp_df <- data.frame(
generation = rep(1:generations),
freq = as.vector(freq_matrix),
population_size = rep(pop_size, generations),
population_num = rep(p, generations))
results <- rbind(results, temp_df)
}
}
return(results)
}
In the first example, we simulate 10 different populations, each consisting of 100 individuals, with an initial allele frequency of 0.5. We will run the simulation for 100 generations.
# Parameters
population_sizes <- 100
generations <- 100
num_populations <- 3
initial_allele_freqs <- rep(0.5,num_populations)
sim.res <- run_sim(population_sizes=population_sizes,
generations=generations,
num_populations=num_populations,
initial_allele_freqs=initial_allele_freqs)
# Count how many populations that has been fixed
sim.res$fixed <- paste(sim.res$population_size,":",sim.res$population_num,sep="")
is.fixed <- NULL
for(pop in unique(sim.res$fixed)){
is.fixed <- c(is.fixed, any(sim.res[sim.res$fixed==pop,]$freq==0 | sim.res[sim.res$fixed==pop,]$freq==1))
}
With the code below we can plot the results.
ggplot(sim.res, aes(x = generation, y = freq,
group = interaction(population_size, population_num),
color = as.factor(fixed))) +
geom_line(alpha = 0.7) +
labs(title = paste("Number of populations fixed:", sum(is.fixed)),
x = "Generation",
y = "Allele Frequency",
color = "Population Number") +
scale_y_continuous(limits = c(0, 1)) +
theme_minimal() +
theme(legend.position = "bottom")
population_sizes <- c(10,50,100)
. How does that affect
the results?initial_allele_freqs
affect the
probability of fixation? Allow migration
A fraction
(migration_rate
) of individuals in the population is
replaced by migrants from an external source population. The new allele
frequency after migration is:
\[p{\prime} = (1 - m) p + m P_m\]
Allow mutations
The allele frequency update
follows:
\[p{\prime} = (1 - \mu) p + \mu (1 - p)\]
# Function to simulate allele frequency changes with drift, migration, and mutation
sim <- function(pop_size, generations, num_populations = 1, allele_freqs = 0.5,
migration_rate = 0, migrant_freq = 0.5, mu = 0) {
# Create a list to store allele frequencies for each population
freq_matrix_list <- vector("list", num_populations)
# Run the simulations
for (p in 1:num_populations) {
allele_freq <- allele_freqs[p]
freq_matrix <- matrix(NA, nrow = generations, ncol = 1)
current_freq <- allele_freq
freq_matrix[1, 1] <- current_freq # Store the initial allele frequency
for (gen in 2:generations) {
# Step 1: Genetic Drift (Binomial Sampling)
a <- rbinom(1, 2 * pop_size, current_freq)
current_freq <- a / (2 * pop_size)
# Step 2: Migration Effect
if (migration_rate > 0) {
current_freq <- (1 - migration_rate) * current_freq + migration_rate * migrant_freq
}
# Step 3: Mutation Effect
if (mu > 0) {
current_freq <- (1 - mu) * current_freq + mu * (1 - current_freq)
}
freq_matrix[gen, 1] <- current_freq
}
freq_matrix_list[[p]] <- freq_matrix
}
return(freq_matrix_list)
}
Below we simulate three independent populations
(num_populations <- 3
) of the same size
(pop_size <- 100
), three diferent starting values of
allele frequencies (allele_freqs <- c(0.4, 0.6, 0.8)
),
and a mutation-rate of 0.01.
# Setup parameters for simulations allowing mutations
pop_size <- 100
generations <- 50
num_populations <- 3
allele_freqs <- c(0.4, 0.6, 0.8)
migration_rate <- 0
mu <- 0.01
sim.res <- sim(pop_size = pop_size,
generations = generations,
num_populations = num_populations,
allele_freqs = allele_freqs,
migration_rate = migration_rate,
mu = mu)
sim.res <- do.call(rbind,sim.res)
sim.res <- data.frame(freq=sim.res[,1],
generation=rep(1:generations, num_populations),
af=sort(rep(allele_freqs, generations)))
ggplot(sim.res, aes(x = generation, y = freq,
color = as.factor(af))) +
geom_line(alpha = 0.7) +
labs(title ="",
x = "Generation",
y = "Allele Frequency",
color = "Initial allele frequency") +
scale_y_continuous(limits = c(0, 1)) +
theme_minimal() +
theme(legend.position = "bottom")
mu=0.005
,mu=0.01
.Next, we look into the effect of migration. We run three simulations
with varying degree of migration
(m <- c(0,0.01,0.1)
).
# Setup parameters for simulations allowing migration
pop_size <- 100
generations <- 50
num_populations <- 10
allele_freqs <- rep(0.5,10)
mu <- 0
m <- c(0,0.01,0.1)
sim.res <- vector(length(m),mode="list")
for(i in 1:length(m)){
tmp.res <- sim(pop_size = pop_size,
generations = generations,
num_populations = num_populations,
allele_freqs = allele_freqs,
migration_rate = m[i],
mu = mu)
tmp.res <- do.call(rbind,tmp.res)
tmp.res <- data.frame(freq=tmp.res[,1],
generation=rep(1:generations, num_populations),
pop=sort(rep(1:num_populations,generations)),
migration=rep(m[i],nrow(tmp.res)))
sim.res[[i]] <- tmp.res
}
sim.res <- do.call(rbind,sim.res)
ggplot(sim.res, aes(x = generation, y = freq,
color = as.factor(pop))) +
facet_wrap(~ migration) +
geom_line(alpha = 0.7) +
labs(title ="",
x = "Generation",
y = "Allele Frequency",
color = "Population") +
scale_y_continuous(limits = c(0, 1)) +
theme_minimal() +
theme(legend.position = "bottom")