MS Thesis Defense

A Specialist Approach for Classification of Column Data

Nikhil Puranik

1:30pm Friday 24 August, 2012, 325b ITE, UMBC

Much information is encoded in spreadsheets, databases, and tables on the Web and in documents. Interpreting this content and making its meaning explicit in a representation language like RDF enables many applications. This thesis addresses the problem of identifying the semantic type of the information represented in a table column containing conventionally encoded data such as phone numbers or stock ticker symbols. We describe a ‘specialist’ approach for classification in which different specialists work together to come up with a ranked list for the given input column. We use three types of specialists: those based on regular expressions, dictionaries and classifiers. We discuss a serial and parallel framework for the specialists. We evaluate our system in two ways: by testing individual specialist for accuracy and by testing the performance of the overall system in terms of generation of ranked list. We also discuss the scalability of the system in terms of addition of new specialists and performance impact for systems with hundreds of specialists.

Committee: Drs. Tim Finin (chair), Anupam Joshi, Tim Oates and Yelena Yesha