Ph.D. Dissertation Proposal

MLVFS: A Scalable Storage System For Managing Big Scientific Data

Navid Golpayegani

3:00-5:00pm Tuesday 24 February 2015, ITE 346

Managing peta or exabytes of data with hundreds of millions to billions of files is a necessary first step towards an effective big data computing and collaboration environment for distributed systems. Current file system designs have focused on providing better and faster data distribution. Managing the directory structure for data discovery becomes an essential element of the scalability problems for big data systems. Recent designs are addressing the challenge of exponential growth of files. Still largely unexplored is the research for dealing with the organizational aspect of managing big data systems with hundreds of millions of files. Most file systems organize data into static directory structures making data discovery, when dealing with large data sets, hard and slow.

This thesis will propose a unique Multiview Lightweight Virtual File System (MLVFS) design to primarily deal with the data organizational management problem in big data file systems. MLVFS is capable of the dynamic generation of directory structures to create multiple views of the same data set. With multiple views, the storage system is capable of organizing available data sets by differing criteria such as location or date without the need to replicate data or use symbolic links. In ad- dition, MLVFS addresses scalability issues associated with the growth of the stored files by removing the internal metadata system and replacing it with generally avail- able external metadata information (i.e. data base servers, project compute servers, remote repositories, etc.). This thesis, moreover, proposes to add, plug in capabilities not normally found in file systems that make this system highly flexible, in terms of specifying sources of meta data information, dynamic file format streaming and other file handling features.

The performance of MLVFS will be tested in both simulated environments as well as real world environments. MLVFS will be installed on the BlueWave cluster at UMBC for simulated load testing to measure the performance for various loads. Simultaneously, stable version of MLVFS will run in real world production environ- ments such as those of the NASA MODIS instrument processing system (MODAPS). The MODAPS system will be used to show examples of real world use cases for MLVFS. Additionally, there will be other systems explored for the real world use of MLVFS, such as at NIST for research into Biomedical Image Stitching.

Committee: Drs. Milton Halem (Chair, Advisor), Yelena Yesha, Charles Nicholas, John Dorband, Daniel Duffy