Suffix trees for very large strings

Dr. Marina Barsky
University of Illinois at Urbana-Champaign

1:00pm Friday, 30 March 2012, ITE 325b, UMBC

The seminar is dedicated to the construction of suffix trees in external memory. A suffix tree is a compact index of all substrings of a given text. While being asymptotically linear in the size of the input, in practice, suffix trees can easily be 50 times larger than the input. As such, suffix trees often exceed typical main memory sizes, even when the input does not. As most existing algorithms are designed for RAM, their performance severely degrades when the tree and/or input do not fit in main memory. So far, this has prevented the wide application of suffix trees for the analysis of massive string collections.

We will look at new advanced methods of suffix tree construction which circumvent memory concerns and allow us to construct suffix trees for inputs of any size using secondary storage (magnetic disks). We will also discuss how this disk-based index can be used for facilitating the pattern discovery in sequential data.

Dr. Marina Barsky is currently a Post-Doctoral fellow in the Department of Computer Science at the University of Illinois at Urbana-Champaign, IL. She received her PhD in Computer Science from the University of Victoria, British Columbia, Canada in 2010. A large part of her post-graduate research was dedicated to pattern discovery in string data, primarily in massive DNA databases. She is currently expanding her expertise in database systems to new areas such as index management and data mining. Her research interests include data mining of sequential data, information networks, and teaching of computer science through interactive interfaces.

Host: Richard Chang