Curating and Combining Big Data from Genetic Studies.
Big data curation is often underappreciated by users of processed data. With the development of high-throughput genotyping technology, large-scale genome-wide data are available for genetic association analysis with disease. In this project, we describe a data-curation protocol to deal with the genotyping errors and missing values in genetic data. We obtain publicly-available genetic data from three studies in the Alzheimer’s Disease Neuroimaging Initiative (ADNI), and with the aid of the freely-available HapMap3 reference panel, we improve the quality and size of the ADNI genetic data. We use the software PLINK to manage data format, SHAPEIT to check DNA strand alignment and perform phasing of the genetic markers that have been inherited from the same parent, IMPUTE2 to impute missing SNP genotypes, and GTOOL to merge files and convert file formats. After merging the genetic data across these studies, we also use the reference panel to investigate the population structure of the processed data. ADNI's participants are collected in the U.S, where the majority of the population are descendants of relatively recent immigrants. We use principal component analysis to understand the population structure of the participants, and model-based clustering to investigate the genetic composition of each participant and compare it with self-reported ethnicity information. This project is intended to serve as a guide to future users of the processed data.