1117 - Kelly Burkette

Abstract


The variation observed in genes in the human genome is a result of stochastic evolutionary processes
such as mutation and recombination acting over time. The gene genealogy for a sample of genes
from unrelated individuals is a tree describing these ancestral events and relationships. Individuals
who are more closely related would be expected to share copies of genes that are similar to each
other. Knowledge of the tree is useful for population genetics, where it can be used in inference
of parameters like the mutation or recombination rate. The genealogical tree may also be useful in
assessing association between a trait or outcome and a genomic location since those with a similar
trait value will tend to also be more closely related genetically if they share a mutation that influences
the value of the trait. However, the time scale for genealogical trees is on the order of tens of
thousands of years, and there is therefore no way to know the true underlying tree for a random
sample of genes from a population.
    In order to incorporate genealogical trees in genetic applications, it is therefore necessary to
model the distribution for the tree conditional on genetic data observed at present. A model for gene
genealogies unconditional on observed data, called the coalescent model, has been well studied and
can be used to simulate sequence data. However, it is not as straightforward to model genealogical
trees that must have given rise to a particular sample. Markov Chain Monte Carlo (MCMC) is one
technique to concentrate sampling on the trees that are likely given the observed data.
    In this thesis, we describe our MCMC based genealogy sampler and present examples on how
it can be used to estimate means of tree statistics of interest. First, we describe the sampler that
assumes that haplotype data are available. Our implementation is based on the sampler described
in Zollner and Pritchard (2005). However, during implementation, we made several changes to
increase the efficiency of sampling. We illustrate the use of our sampler on haplotype data from
a publicly available dataset, where we examine statistics summarizing the degree to which case
haplotypes are more related to each other than the control haplotypes.