Masters Thesis Defense

Heart Disease Prediction: A Data Mining Approach

Soma Das

2:00pm Monday, 23 April 2012, ITE 201B

Data mining is a field of computer science that combines statistical analysis and machine learning to detect hard-to-discern patterns from large amounts of data. It employs different algorithms to learn different patterns from training or experience and apply it to classify, predict or identify patterns. The healthcare environment is very information rich. There is a wealth of clinical data available within the healthcare systems. Also due to recent advancement of genomic research vast amount of genetic data are also available. Effective analysis tools are needed to discover hidden relationships and trends in these data. These tools are necessary to correctly diagnose people at risk of disease based on the derived knowledge from the data.

We used data mining techniques to evaluate the interaction between traditional risk factors and gene variants such as Single Nucleotide Polymorphisms (SNPs) towards Coronary Heart Disease (CHD) susceptibility in a prospective study of older population aged 65 and older. In our thesis we asked two questions whether we can predict CHD at birth or adding genetic information to traditional risk factors predict CHD better than traditional risk factors alone.We also analyzed two popular machine learning algorithms to determine the most efficient method on medical datasets mining. The evaluation is based on a set of performance metrics. We also applied a clustering method to identify different subgroups present in the selected datasets.

We chose eight traditional risk factors of CHD and 23 SNPs that had previously been reported to be associated with CHD. We then tested the association of these SNPs with CHD in cardiovascular Health Study (CHS). Based on previous studies, we pre specified a risk allele for each of 23 SNPs. We assigned coding values for homozygote, heterozygote, and the no risk homozygote SNPs and then combined these with traditional risk factors for each individual before feeding it to machine learning algorithms. We evaluated different classification algorithms using 10 fold cross validation test.

Receiver Operating Characteristic Curves (ROC) were plotted separately based on traditional risk factors alone and traditional risk factors plus SNPs. The increase in the Area Under Curve (AUC) was statistically significant for Whites and suggestive of improved CHD prediction for African American. We also found out that using only SNPs predicts CHD a little bit better than random guessing for only whites. The results gained from analysis suggest Naïve Bayes to be the best classifier for the given domain.

This study demonstrates the concept of using multiple SNPs as independent risk factors and indicates that it can improve prediction of incident CHD. Adding SNPs to traditional risk factors did not improve the prediction model dramatically as we expected but it was statistically significant.

Committee:

  • Dr. Michael Grasso (co-chair)
  • Dr. Anupam Joshi
  • Dr. Yelena Yesha