A Scalable, Fault Tolerant Programming Model for
Developing Data Intensive Parallel Applications

Tyler A. Simon
Faculty Research Scientist
UMBC Center for Hybrid Multicore Productivity Research

1:00pm Friday 17 February 2012, ITE 325b

Future exascale computing systems will have to execute a single program on the order of 10^8-10^9 individual, low powered processing elements. These processors need to be fed data efficiently and reliably through the duration of a parallel computation. The current methods for explicit message passing between processors provide little in terms of fault tolerance support and the overheads of system level and application checkpoint/restart incur unreasonable overheads for exascale class computing systems.

We propose the development of novel autonomic execution model and an Adaptive Runtime Resource for Intensive Applications (RRIA), which improves application reliability, scalability and performance while freeing the programmer from explicit message passing. Experiments were conducted to evaluate ARRIA's capabilities on data intensive applications, those where the majority of execution time is spent reading and writing either to local or remote memory locations. In our approach, we focus on managing data movement both on a compute node and across a cluster of nodes for the application during runtime. We use a hybrid "threaded data parallel" model in which message passing is hidden entirely from the programmer and parallel tasks are bundled and farmed to a dynamic resource pool for execution.

Tyler Simon is a Faculty Research Assistant and PhD student working for the Center for Hybrid Multicore Productivity Research in the Computer Science and Electrical Engineering Department at the University of Maryland Baltimore County. He is also a computational scientist at the NASA Center for Climate Simulation at Goddard Space Flight Center. Tyler is interested in the theoretical and practical aspects of concurrency and parallel computation in general. His research is focused on what can be done with the effective application of distributed, parallel algorithms in high performance computing environments, particularly in parallel numerical methods and data movement.

Host: Yelena Yesha