Correlation Aware Optimizations for Analytic Databases

Hideaki Kimura, Brown University

1:00pm Friday 9 March 2012, ITE 325b, UMBC

Recent years have seen that the analysis of large data-sets is crucially important in a wide range of business, governmental, and scientific applications. For example, research projects in astronomy need to analyze petabytes of image data taken from telescopes. Providing a fast and scalable analytical data management system for such users has become increasingly important.

The major bottlenecks for analytics on such big data are disk- and network-I/O. Because the data is too large to fit in RAM, each query causes substantial disk I/O. Traditional database systems provide indexes to speed up disk reads, but many analytic queries do not benefit from indexes because data is scattered over a large number of disk blocks and disk seeks are prohibitively expensive. Furthermore, such huge data sets need to be partitioned and distributed over hundreds or many thousands of nodes. When a query requires more than one data at once, such as a query involving a JOIN operation, the data management system must transmit a large amount of data over the network. For example, the Shuffle phase in Map-Reduce systems copies file blocks over the network and causes a significant bottleneck in many cases.

Our approach to tackling these challenges in big data analytics is to exploit correlations. I will describe our correlation-aware indexing, replication, and data placement which make big data analytics faster and more scalable.

Finally, if time allows, I will also introduce another on-going project to develop a scalable transactional processing system on modern hardware in collaboration with Hewlett-Packard Laboratories.

Hideaki Kimura is a doctoral candidate in the Computer Science Department at Brown University. His main research interests are in data management systems. His dissertation research with Prof. Stan Zdonik is on correlation-based optimizations for large analytic databases. He also worked on transaction processing systems exploiting modern hardware at HP Labs.

Host: Anupam Joshi
See http://www.csee.umbc.edu/talks for more information