Human Genomics

Part I

Exercise 1

Population geneticists analyze the extent and patterns of genetic variation within populations, the distribution of genotypes, and the influence of evolutionary forces like natural selection and genetic drift. These factors drive evolutionary change and speciation. Based on your current understanding, how would you address the following questions?

How can we measure genetic variation within a population?
How can we determine whether a population’s genetic structure is changing over time?

Exercise 2

Which term refers to the complete set of genetic information shared by all individuals within a population?

Gene pool
Genome
Chromosome complement
Breeding unit
Race

Various mechanisms help preserve genetic diversity within a population. Why is maintaining this diversity beneficial?

Homozygosity provides an evolutionary advantage.
Increased diversity promotes inbreeding benefits.
Genetic diversity enhances a population’s ability to adapt to environmental changes.
Greater genetic diversity increases the likelihood of haploidy.
Genetic diversity prevents diploidy.

Exercise 3

Human serum contains a protein called haptoglobin, which binds to hemoglobin following the lysis of erythrocytes. In some populations, three electrophoretic haptoglobin variants are observed, determined by three autosomal co-dominant alleles: A, B, and C. Among 500 individuals, the following genotype counts were recorded:

No individuals						Total
AA	AB	AC	BB	BC	CC
109	123	128	33	70	37	500

Compute the genotype frequencies within the sample.
Compute the allele frequency within the sample.

Exercise 4

Cystic fibrosis (CF) is an autosomal recessive disease. In individuals who are homozygous for the recessive allele, a malfunction in salt transport occurs in certain cell types, such as those in the lungs. This defect leads to thick mucus buildup, increasing the risk of bacterial lung infections.

Historically, only a small number of individuals with CF survived into adulthood. In Europe, the birth prevalence of CF has been approximately 1 in 2,500 newborns. This frequency is maintained by a balance between mutation and natural selection.

In a European sample, cystic fibrosis (CF) carrier status was analyzed based on three genotypes: 16,890 individuals were homozygous for the normal allele (RR), 855 individuals were heterozygous carriers (Rr), and 18 individuals were homozygous for the recessive CF allele (rr).

Calculate the frequency of each allele in the sample.

Exercise 5

A geneticist studying a human population discovers that a recessive allele (s) associated with sickle cell disease is present in a community of 1,024 individuals. Individuals carrying two copies of the recessive allele (ss) develop sickle cell disease, while those with at least one dominant allele (S) have normal red blood cells. Within the population, 1 in 256 individuals has sickle cell disease, indicating they are homozygous recessive (ss).

Assuming random mating and no selection for this gene, how many individuals with normal red blood cells are carriers of the recessive allele?

Exercise 6

This exercise should be solved using R.

Run every step in the code below, and explain what each step does.

# define alleles
  A1 <- "A"
  A2 <- "a"

# define frequency of A1
  p <- 0.2
  allele <- c(rep(A1,p*10),rep(A2,(1-p)*10))

# define population size
  N <- 100

# define population
  pop <- matrix(nrow=N, ncol=2)
  
  for(i in 1:nrow(pop)){
    pop[i, 1] <- sample(allele, 1)
    pop[i, 2] <- sample(allele, 1)
  }

# combine alleles into genotype
  pop <- cbind(pop, paste(pop[,1],pop[,2],sep=""))

We can count how many we have of each genotype using the table(snp) function.

Determine the genotype and allele frequency.
Make sure you understand every step.

# genotype count
  obs.count <- table(pop[,3])

# combine the two heterozygote genotypes (Aa and aA is the same)
  pop[,3][pop[,3]=="aA"] <- "Aa"
  obs.count <- table(pop[,3])

# total number of observations
  Nobs <- sum(obs.count)

# genotype frequencies
  geno.freq <- obs.count / Nobs

# allele frequencies
  p.A <- (2*obs.count["AA"] + obs.count["Aa"]) / (2*Nobs)
  p.a <- (2*obs.count["aa"] + obs.count["Aa"]) / (2*Nobs)

Does it match the expectation based on how you simulated the genotypes?
We can redo the simulations a number of times, and with different population sizes.

# set global parameters
  N <- seq(10,1000,10)
  rep <- 10
  A1 <- "A"
  A2 <- "a"
  p <- 0.2
  allele <- c(rep(A1,p*10),rep(A2,(1-p)*10))

  p.A <- matrix(nrow=rep, ncol=length(N))  
  colnames(p.A) <- N
  rownames(p.A) <- paste("rep",1:rep,sep="")
  
# run simulations
  for(i in 1:length(N)){
    for(j in 1:rep){
      pop <- matrix(nrow=N[i], ncol=2)  
      
      for(k in 1:nrow(pop)){
        pop[k, 1] <- sample(allele, 1)
        pop[k, 2] <- sample(allele, 1)
      }
    pop <- cbind(pop, paste(pop[,1],pop[,2],sep=""))
    pop[,3][pop[,3]=="aA"] <- "Aa"
    
    obs.count <- table(pop[,3])
    Nobs <- sum(obs.count)
    geno.freq <- obs.count / Nobs
    p.A[j,i] <- (2*obs.count["AA"] + obs.count["Aa"]) / (2*Nobs)
    }
  }

Why does the matrix p.A contains many values that are NA ?
Visualise the results from these simulations. What do you see?

library(ggplot2)
library(reshape2)
p.A <- melt(p.A)
ggplot(p.A[!is.na(p.A$value),], aes(x = Var2, y = value, fill = factor(Var2))) +
  geom_hline(yintercept = p, linetype = "dashed", color = "darkorange", linewidth = 0.75) +
  geom_boxplot(alpha = 0.6) + 
   coord_cartesian(ylim = c(0.1, 0.35)) +
  labs(x = "N", y = "p(A)", title = "") +
  theme_classic() + theme(legend.position = "none")

Exercise 7

In the second part of these exercises, we will examine deviations from expected genotype frequencies. According to the Hardy-Weinberg principle, genotype frequencies can be predicted from allele frequencies under certain assumptions. Specifically, if allele A has a frequency of p and allele a has a frequency of q, the expected genotype frequencies are:

AA \(p^2\)
Aa \(2pq\)
aa \(q^2\)

Next, we will create a figure that visualizes the predicted genotype frequencies for different values of p. Take the time to review and understand the code used to generate it.

curve(2*x*(1-x),col="darkblue", ylim=c(0,1), lwd=2, xlab="Allele frequency", ylab="Genotype frequency", las=1)
  text(x=0.5, y=0.7, labels="Aa", col="darkblue", font=2)
curve(x**2,col="darkgreen", add=T, ylab="NULL", ylim=c(0,1), lwd=2)
  text(x=0.9, y=0.7, labels="AA", col="darkgreen", font=2)
curve((1-x)**2,col="darkred",add=T, ylab="NULL", ylim=c(0,1),lwd=2)
  text(x=0.1, y=0.7, labels="aa", col="darkred", font=2)

Part II

Exercise 8

In a study of 10,000 individuals of Danish ancestry, researchers analyzed two genetic loci, including the CFTR gene (associated with cystic fibrosis) and the APOE gene (linked to Alzheimer’s disease risk). These loci were chosen due to their well-established roles in disease susceptibility and health outcomes, providing insights into the genetic factors influencing these conditions in the population.

CFTR				APOE
WT \| WT	WT \| F508del	F508del \| F508del		ε3 \| ε3	ε3 \| ε4	ε4 \| ε4
3156	4997	1847		6837	3065	98

Compute the allele- and genotype frequencies for the two alleles in both the CFTR- and APOE-locus.
Perform the necessary calculations to determine whether the allele frequencies at the two loci follow Hardy-Weinberg equilibrium proportions.

Exercise 9

Cystic fibrosis (CF) is an autosomal, recessive disease. Homozygotes individuals (F508del|F508del) have reduced function of the salt transporters in some cell types, including the lungs. This entails mucus in the airways which increases the risk of bacterial infections.

Until recently, only few individuals survived to adulthood. In Europe, the frequency of newborns with CF has historical been 1:2500. This rate is maintained by a balance between selection (\(s\)) and mutation (μ).

In a study blood samples were collected to determine the genotype frequency of a certain CF mutation. The disease-causing allele is called F508del, while the non-disease-causing allele is called WT. The genotypes of the enrolled participants are shown in the table below.

CFTR
WT \| WT	WT \| F508del	F508del \| F508del
16890	855	18

Compute the allele frequency in the sample.
How many individuals do you expect to be born with each genotype if the sample is in Hardy-Weinberg proportions?
Is the sample in Hardy-Weinberg proportions?
Compute what the mutation rate (μ) must be, when there in the population through many generations has been maintained a frequency of 1:2500 among newborns and when the fitness (\(w=1-s\)) of CF-individuals is 0%.
Assume that a new medical treatment for CF has been developed such that patients with CF live longer and many of the CF-carriers also get children. This entails that the fitness of individuals with CF is changed to 40% compared to persons without CF. At equilibrium, what will the new frequency be of the CF-allele F508del, and what will the frequency be of newborns with CF?

Exercise 10

Out of 2,400 births at a UK hospital, 6 newborn deaths were recorded, all of which were attributed to a colon defect caused by a recessive lethal allele (co) that follows an autosomal inheritance pattern. Assuming Hardy-Weinberg equilibrium, answer the following:

What is the frequency of the co allele in the newborn population?
What is the frequency of heterozygous carriers in both newborns and adults?

Exercise 11

Red-green colour blindness is caused by a recessive, X-linked gene (f). In a large population, the frequency of the f allele is denoted as \(q\) in both males and females. Assuming random mating and no selection, answer the following:

What is the expected frequency of colour-blind females in the population?
What is the expected frequency of marriages between a colour-blind male and a non-colour-blind female?
What is the expected frequency of colour-blind boys born from a colour-blind father and a non-colour-blind mother?

Exercise 12

Now we will use R to test for Hardy-Weinberg proportion. Run every step in the code below, and explain what each step does.

First we define variables.

# define genotype counts
  AA <- 375
  Aa <- 165
  aa <- 32
  
# find the total number of individuals
  n <- AA+Aa+aa

We can calculate the allele frequencies from the observed genotype counts.

p <- (2*AA + Aa) / (2*n)
q <- c(2*aa + Aa) / (2*n)

# check that we have counted all alleles
p+q == 1

Using the allele frequencies we can compute the expected genotype frequencies if the population is in Hardy-Weinberg equlibrium.

# calculating expected values  
  E.AA <- n*p**2
  E.Aa <- n*2*p*q
  E.aa <- n*q**2

# check that we have counted all individuals
  E.AA + E.Aa + E.aa == n

We can do a formal statistical test for whether the number of genotypes we observe matches our expectations. This is done using a \(\chi^2\)-test which is defined as: \[\chi^2 = \sum_{i=1}^{geno} = \frac{(E - O)^2}{E}\] where \(E\) is the expected number of counts, and O is the observed number of counts, which is summed across all genotype categories (\(geno\)).

The \(\chi^2\)-statistican can be calculated as:

# compute chi-square test-statistic
  chi2 <- (E.AA-AA)^2/E.AA + (E.Aa-Aa)^2/E.Aa + (E.aa-aa)^2/E.aa

Next we have to determine whether the \(\chi^2\)-statistic is larger than expected. To compute the probability value (i.e., P-value) from a \(\chi^2\)-distribution we can use the function pchisq() with the argument lower.tail=FALSE because we to see the probability that the \(\chi^2\) value being higher than it is. To have an appropriate distribution, we need to add how many degrees of freedom (df) to consider, \(df=\frac{n(n1)}{n}\), where \(n=\textrm{no. alleles}\). Thus, in our example, \(df=\frac{2(2-1)}{2}=1\).

# obtain P-value
  pval <-pchisq(chi2, df=1, lower.tail=FALSE)

Therefore, the P-value is 0.018, indicating that at a 5% significance level, our sample does not support the Hardy-Weinberg equilibrium. The figure below shows the \(\chi^2\)-distribution with \(df=1\) where the vertical line shows the \(\chi^2\)-value from our analysis.

We can visualise the observed and expected genotype counts to get an idea of why the sample is not in Hardy-Weinberg.

geno <- c(AA, Aa, aa)
expe <- c(E.AA, E.Aa, E.aa)
dat <- matrix(c(geno,expe),nrow=2, byrow=T)

barplot(dat, beside=T, col=c("#6D8325FF", "#BD5630FF"), names.arg=c("AA","Aa","aa"), las=1, cex.axis=.8, xlab="Genotype", ylab="Genotype counts")
legend("topright",legend=c("Observed", "Expected"),pch=15, col=c("#6D8325FF", "#BD5630FF"), cex=.8)

What possible explanations can you provide for why this sample does not follow Hardy-Weinberg equilibrium?

Part III

Exercise 13

Given a population with an average inbreeding coefficient (\(F\)) and allele frequencies \(p\) (for allele A) and \(q\) (for allele a), what are the expected genotype frequencies at this locus? Additionally, how does inbreeding influence both allele frequencies and genotype frequencies?
Compute the inbreeding coefficient for individual K in the pedigree below.
It turns out that individual C in the pedigree above is the offspring of a cousin marriage. Redraw the pedigree focusing on individual C, and use this updated information to calculate the inbreeding coefficient (\(F\)) for individual K.

Exercise 14

The ability to taste the bitter compound phenylthiocarbamide (PTC) is controlled by a dominant allele (T), meaning individuals with at least one T allele can perceive its bitterness, while t homozygotes cannot. PTC tasting is linked to variation in the TAS2R38 gene, which encodes a bitter taste receptor on the tongue.

PTC itself is not found in foods but is chemically similar to naturally occurring bitter compounds in certain vegetables, such as broccoli, Brussels sprouts, kale, and cabbage. The ability to taste these compounds may have evolutionary significance, as it could help individuals avoid toxic plants that contain bitter-tasting alkaloids.

In a study of 100 students, 62 individuals had the TT genotype (strong tasters), 32 had the Tt genotype (moderate tasters), and 6 had the tt genotype (non-tasters). Calculate the allele frequencies of T and t in this population.
Using the observed genotype frequencies, assess whether the population is in Hardy-Weinberg equilibrium by comparing expected and observed proportions.

Exercise 15

In this R-exercise, we will investigate, how population size influence the fixation and extinction probabilities of individual alleles.

The functionsim allow you to simulate num_populations with a population size given by pop_size (which can vary between the different populations) for generations. The starting allele frequency can be specified for each simulated population with allele_freqs.

# Function to simulate allele frequency across generations for multiple populations
sim <- function(pop_size, generations, num_populations = 1, allele_freqs = 0.5) {
  
  # Create a list to store allele frequencies for each population
  freq_matrix_list <- vector("list", num_populations)
  
  # Run the simulations
  for (p in 1:num_populations) {
    allele_freq <- allele_freqs[p]
    freq_matrix <- matrix(NA, nrow = generations, ncol = 1)
    current_freq <- allele_freq
    freq_matrix[1, 1] <- current_freq  # Store the initial allele frequency
      
    for (gen in 2:generations) {
      # Simulate the allele frequency for the next generation using binomial sampling
      a <- rbinom(1, 2*pop_size, current_freq) 
      current_freq <- a/(2*pop_size)
      freq_matrix[gen, 1] <- current_freq
    }
    
    freq_matrix_list[[p]] <- freq_matrix
  }
  
  return(freq_matrix_list)
}

The function below allow us to run several simulations and collect the results.

  run_sim <- function(population_sizes=NULL, generations=NULL, num_populations=NULL, initial_allele_freqs=NULL){
    # Store the results in a data frame
    results <- data.frame()

    # Run the simulation for different population sizes
    for (pop_size in population_sizes) {
      freq_matrix_list <- sim(pop_size,generations,num_populations,allele_freqs = initial_allele_freqs)
  
      # Prepare the data for plotting
      for (p in 1:num_populations) {
        freq_matrix <- freq_matrix_list[[p]]
        temp_df <- data.frame(
          generation = rep(1:generations),
          freq = as.vector(freq_matrix),
          population_size = rep(pop_size, generations),
          population_num = rep(p, generations))
      results <- rbind(results, temp_df)
      }
    }
    
  return(results)
  }

In the first example, we simulate 10 different populations, each consisting of 100 individuals, with an initial allele frequency of 0.5. We will run the simulation for 100 generations.

# Parameters
  population_sizes <- 100
  generations <- 100
  num_populations <- 3
  initial_allele_freqs <- rep(0.5,num_populations)

  sim.res <- run_sim(population_sizes=population_sizes, 
                     generations=generations, 
                     num_populations=num_populations, 
                     initial_allele_freqs=initial_allele_freqs)

# Count how many populations that has been fixed  
  sim.res$fixed <- paste(sim.res$population_size,":",sim.res$population_num,sep="")
  is.fixed <- NULL
  for(pop in unique(sim.res$fixed)){
    is.fixed <- c(is.fixed, any(sim.res[sim.res$fixed==pop,]$freq==0 | sim.res[sim.res$fixed==pop,]$freq==1))
  }

With the code below we can plot the results.

  ggplot(sim.res, aes(x = generation, y = freq, 
                      group = interaction(population_size, population_num), 
                      color = as.factor(fixed))) +
    geom_line(alpha = 0.7) +
    labs(title = paste("Number of populations fixed:", sum(is.fixed)),
       x = "Generation",
       y = "Allele Frequency",
       color = "Population Number") +
    scale_y_continuous(limits = c(0, 1)) +
    theme_minimal() +
    theme(legend.position = "bottom")

With you own words, describe what is happening in the figure above.
In the next simulation, change population_sizes <- c(10,50,100). How does that affect the results?
How does varying the initial_allele_freqs affect the probability of fixation?

Migration and mutation

Allow migration
A fraction (migration_rate) of individuals in the population is replaced by migrants from an external source population. The new allele frequency after migration is:

\[p{\prime} = (1 - m) p + m P_m\]

\(m\) = migration_rate (proportion of migrants each generation),
\(p\) = current allele frequency in the population,
\(P_m\) = migrant_freq (allele frequency of migrants).

Allow mutations
The allele frequency update follows:

\[p{\prime} = (1 - \mu) p + \mu (1 - p)\]

\(p\) = current allele frequency,
\(\mu\) = mutation rate per generation

# Function to simulate allele frequency changes with drift, migration, and mutation
sim <- function(pop_size, generations, num_populations = 1, allele_freqs = 0.5, 
                migration_rate = 0, migrant_freq = 0.5, mu = 0) {
  
  # Create a list to store allele frequencies for each population
  freq_matrix_list <- vector("list", num_populations)
  
  # Run the simulations
  for (p in 1:num_populations) {
    allele_freq <- allele_freqs[p]
    freq_matrix <- matrix(NA, nrow = generations, ncol = 1)
    current_freq <- allele_freq
    freq_matrix[1, 1] <- current_freq  # Store the initial allele frequency
    
    for (gen in 2:generations) {
      # Step 1: Genetic Drift (Binomial Sampling)
      a <- rbinom(1, 2 * pop_size, current_freq) 
      current_freq <- a / (2 * pop_size)
      
      # Step 2: Migration Effect
      if (migration_rate > 0) {
        current_freq <- (1 - migration_rate) * current_freq + migration_rate * migrant_freq
      }
      
      # Step 3: Mutation Effect
      if (mu > 0) {
        current_freq <- (1 - mu) * current_freq + mu * (1 - current_freq)
      }
      
      freq_matrix[gen, 1] <- current_freq
    }
    
    freq_matrix_list[[p]] <- freq_matrix
  }
  
  return(freq_matrix_list)
}

Below we simulate three independent populations (num_populations <- 3) of the same size (pop_size <- 100), three diferent starting values of allele frequencies (allele_freqs <- c(0.4, 0.6, 0.8)), and a mutation-rate of 0.01.

# Setup parameters for simulations allowing mutations
  pop_size <- 100
  generations <- 50
  num_populations <- 3
  allele_freqs <- c(0.4, 0.6, 0.8)
  migration_rate <- 0
  mu <- 0.01
  
  sim.res <- sim(pop_size = pop_size, 
                 generations = generations, 
                 num_populations = num_populations, 
                 allele_freqs = allele_freqs, 
                 migration_rate = migration_rate, 
                 mu = mu)

  sim.res <- do.call(rbind,sim.res)
  sim.res <- data.frame(freq=sim.res[,1], 
                        generation=rep(1:generations, num_populations),
                        af=sort(rep(allele_freqs, generations)))

  ggplot(sim.res, aes(x = generation, y = freq, 
                      color = as.factor(af))) +
    geom_line(alpha = 0.7) +
    labs(title ="",
       x = "Generation",
       y = "Allele Frequency",
       color = "Initial allele frequency") +
    scale_y_continuous(limits = c(0, 1)) +
    theme_minimal() +
    theme(legend.position = "bottom")

How does the results from the previous simulations compare to when you allow for small - moderate levels of mutation? Fx mu=0.005,mu=0.01.
What happens when you change the starting allele frequencies?

Next, we look into the effect of migration. We run three simulations with varying degree of migration (m <- c(0,0.01,0.1)).

# Setup parameters for simulations allowing migration
  pop_size <- 100
  generations <- 50
  num_populations <- 10
  allele_freqs <- rep(0.5,10)
  mu <- 0
  
  m <- c(0,0.01,0.1)
  sim.res <- vector(length(m),mode="list")
  for(i in 1:length(m)){
  
    tmp.res <- sim(pop_size = pop_size, 
                 generations = generations, 
                 num_populations = num_populations, 
                 allele_freqs = allele_freqs, 
                 migration_rate = m[i], 
                 mu = mu)

    tmp.res <- do.call(rbind,tmp.res)
    tmp.res <- data.frame(freq=tmp.res[,1], 
                        generation=rep(1:generations, num_populations),
                        pop=sort(rep(1:num_populations,generations)),
                        migration=rep(m[i],nrow(tmp.res)))

    sim.res[[i]] <- tmp.res  
  }
sim.res <- do.call(rbind,sim.res)

 ggplot(sim.res, aes(x = generation, y = freq, 
                      color = as.factor(pop))) +
      facet_wrap(~ migration) +
      geom_line(alpha = 0.7) +
      labs(title ="",
         x = "Generation",
         y = "Allele Frequency",
         color = "Population") +
      scale_y_continuous(limits = c(0, 1)) +
      theme_minimal() +
      theme(legend.position = "bottom")

What conclusion can you draw from these simulations?
Finally, try to set up a scenario where you combine both migration and mutation. Which forces have the strongest effect?