Ph.D. Dissertation Defense
Computer Science and Electrical Engineering
University of Maryland, Baltimore County

Finding Story Chains and Creating Story Maps in Newswire Articles

Xianshu Zhu

10:00-12:00pm Monday 25 November 2013, ITE 325B

There are huge amounts of news articles about events published on the Internet everyday. The flood of information on the Internet can easily swamp people, which seems to produce more pain than gain. While there are some excellent search engines, such as Google, Yahoo and Bing, to help us retrieve information by simply providing keywords, the problem of information overload makes it hard to understand the evolution of a news story. Conventional search engines display unstructured search results, which are ranked by relevance using keyword-based ranking methods and other more complicated ranking algorithms. However, when it comes to searching for a story (a sequence of events), none of the ranking algorithms above can organize the search results by evolution of the story. Limitations of unstructured search results include: (1) Lack of the big picture on complex stories. In general, news articles tend to describe the news story from different perspectives. For complex news stories, users can spend significant time looking through unstructured search results without being able to see the big picture of the story. For instance, Hurricane Katrina struck New Orleans on August 23, 2005. By typing “Hurricane Katrina” in Google, people can get much information about the event and its impact on the economy, health, and government policies, etc. However, people may feel desperate to sort the information to form a story chain that tells how, for example, Hurricane Katrina has impacted government policies. (2) Hard to find hidden relationships between two events: The connections between news events are sometimes extremely complicated and implicit. It is hard for users to discover the connections without thorough investigation of the search results.

In this dissertation, we seek to extend the capability of existing search engines to output coherent story chains and story maps (a map that demonstrates various perspectives on news events), rather than loosely connected pieces of information. By this means, people can obtain a better understanding of the news story, capture the big picture of the news story quickly, and discover hidden relationships between news events. First of all, algorithms for finding story chains have the following two advantages: (1) they can find out how two events are correlated by finding a chain of events that coherently connect them together. Such story chains will help people discover hidden relationship between two events. (2) they allow users to search by complex queries such as “how is event A related to event B”, which does not work well on conventional keyword-based search engines. Secondly, creating story maps by finding different perspectives on a news story and grouping news articles by the perspectives can help users better capture the big picture of the story and give them suggestions on what directions they can further pursue. From a functionality point of view, the story map is similar to the table of content of a book which gives users a high-level overview of the story and guides them during news reading process.

The specific contributions of this dissertation are: (1) Develop various algorithms to find story chains, including: (a) random walk based story chain algorithm; (b) co-clustering based story chain algorithm which further improves the story chains by grouping semantically close words together and propagating the relevance of word nodes to document nodes; (c) finding story chains by extracting multi-dimensional event profiles from unstructured news articles, which aims to better capture relationships among news events. This algorithm significantly improves the quality of the story chains. (2) Develop an algorithm to create story maps which uses Wikipedia as the knowledge base. News articles are represented in the form of bag-of-aspects instead of bag-of-words. Bag-of-aspects representation allows users to search news articles through different aspects of a news event but not through simple keywords matching.

Committee: Drs. Tim Oates (chair), Tim Finin, Charles Nicholas, Sergei Nirenburg and Doug Oard