Sustaining Large-Scale Scientific Applications in the Presence of Failure


Large computational science problems, like weather forecasting and molecular modeling, exceed the boundaries of conventional computing systems. High Performance Computing (HPC) systems (often referred to as supercomputers) consist of thousands of computers all working in concert to solve these computationally intensive scientific problems. As HPC systems grow larger, computer scientists are asked to help solve increasingly complex issues such as managing high levels of concurrency, recovering from failures, and debugging and tuning thousands of concurrent processes. In this talk, I will introduce some of the computational science domains pushing the boundaries of HPC systems, and the structure of those systems. From the many complex issues that face modern HPC systems, I will focus on those related to fault tolerance and software assurance. I will discuss some established and emerging fault tolerance techniques used to sustain scientific applications in the presence of failure, and present some recent results. I will conclude by discussing some future research directions.

Supplementary Materials

Below are links to the works cited during the seminar. The links are organized chronologically as they were presented during the seminar.