MS Thesis Defense

Integrating Domain Knowledge in Supervised Machine Learning
to Assess the Risk of Breast Cancer Using Genomic Data

Aniket Bochare

9:00am Friday 29 June 2012, ITE 325b

Breast cancer is the most common form of cancer in women. Breast cancer comprises 22.9% of the invasive cancers in women and 16% of all the female cancers. Currently, treatment decisions are based primarily on clinical parameters, with little use of genomic data. Our study takes into consideration the data of postmenopausal women of European descent and their single nucleotide polymorphism (SNP) information to assess the risk of developing breast cancer. We used various supervised machine learning and data mining techniques to generate a model for predicting risk of breast cancer using only genomic data.

In this research we propose an approach to select the nine best SNPs using various feature selection algorithms to improve binary classification accuracy and validate our results with the existing literature. The machine learning model generated without the domain knowledge yields poor prediction results. After the addition of the domain knowledge of the 11 SNPs into the original training set we performed classification using the best features obtained by feature selection techniques. The machine learning model generated using both the domain knowledge and the feature selection techniques performed much better compared to the naive approach of classification.

Committee: Drs. Yelena Yesha (chair), Anupam Joshi, Aryya Gangopadhyay and Micheal Grasso