Ph.D. Thesis Defense Announcement

Data Intensive Scientific Compute Model for Multicore Clusters

Phuong Nguyen

10:00am 21 November 2012, ITE 325B

Data intensive computing holds the promise of major scientific breakthroughs and discoveries from the exploration and mining of the massive data sets becoming available to the science community. This expectation has led to tremendous increases in data intensive scientific applications. However, data intensive scientific applications still face severe challenges in accessing, managing and analyzing petabytes of data. In particular, workflow systems to support such scientific applications are not as efficient when dealing with thousands and even more of complex tasks within jobs that operate across high performance large multicore clusters with very large amounts of streaming data. Scheduling, it turns out, is an integral workflow component in the execution often of thousands or more tasks within a data intensive scientific application as well as in managing  the access and flow of many jobs to the available resource environment. Recently, MapReduce systems such as Hadoop, have proven successful for many business data intensive problems. However, there are still many limitations in the use of MapReduce systems for data-intensive scientific problems mainly because they do not support the characteristics of science such as data formats, specialized data analytic tools (e.g. math libraries), accuracies, and interfaces with non MapReduce components.

This thesis addresses some of these limitations by proposing a MapReduce workflow model and its runtime system using Hadoop for orchestrating MapReduce jobs for data intensive scientific workflows. Novel heuristic based scheduling algorithm is proposed in the workflow system to manage the parallel execution of data intensive scientific applications. This thesis has developed a hybrid MapReduce scheduling algorithm based on dynamic priorities, proportional resource sharing techniques that reduce delays for variable length concurrent tasks, and takes advantage of data locality. As a result, a new scheduling policy, Balanced Closer to Finish First (BCFF), is proposed as solutions for some problems of scheduling in MapReduce environment. The scheduling algorithm is implemented in Hadoop 1.0.1 framework and is available as a new Hadoop plug-in Scheduler. The evaluations of the workflow system on the climate data processing and analysis application (several TB dataset) show that it is feasible and significantly improved compared to traditional parallel processing method. The scientific results of the application provide new source of monitoring global climate changes for the near decade 2002-2011.

Thesis Committee:

  • Prof. Milton Halem (Chair)
  • Prof. Yelena Yesha (Co-Chair)
  • Prof. Tim Finin
  • Prof. Yaacov Yesha
  • Prof. Tarek El-Ghazawi at George Washington University