Title:
Sustaining Large-Scale Scientific Applications in the Presence of Failure
Abstract
Large computational science problems, like weather forecasting and molecular modeling, exceed the boundaries of conventional computing systems. High Performance Computing (HPC) systems (often referred to as supercomputers) consist of thousands of computers all working in concert to solve these computationally intensive scientific problems. As HPC systems grow larger, computer scientists are asked to help solve increasingly complex issues such as managing high levels of concurrency, recovering from failures, and debugging and tuning thousands of concurrent processes. In this talk, I will introduce some of the computational science domains pushing the boundaries of HPC systems, and the structure of those systems. From the many complex issues that face modern HPC systems, I will focus on those related to fault tolerance and software assurance. I will discuss some established and emerging fault tolerance techniques used to sustain scientific applications in the presence of failure, and present some recent results. I will conclude by discussing some future research directions.
Supplementary Materials
Below are links to the works cited during the seminar. The links are organized chronologically as they were presented during the seminar.
- Overview
- President's Information Technology Advisory Committee, "Computational Science: Ensuring America's Competitiveness," 2005. [Link]
- Astrophysics
- Patrik Jonsson, Greg Novak, Joel Primack, UC Santa Cruz, 2008. [Link]
- Jennifer Lotz, Joel R. Primack, "Astronomers Pin Down Galaxy Collision Rates by comparing Hubble Space Telescope Photographs to Supercomputer Simulations," 2011. [Link]
- FLASH Center for Computational Sciences. [Link]
- "Catch the wave," 2012. [Link]
- Weather Forecasting and Climate Simulation
- Prabhat (LBL), Michael Wehner (LBL), Wes Bethel (LBL), "Hurricane Season," SciDAC 2010. [Link]
- National Center for Atmospheric Research (NCAR), "New Computer Model Advances Climate Change Research", 2010. [Link]
- Oak Ridge National Laboratory Everest PowerWall display. [Link] [Link]
- Molecular Dynamics
- Folding@Home Project. [Link]
- "Researchers Show How Proteins Help DNA Replicate Past a Damaged Site", 2011. [Link]
- SciDAC Review, "Modeling the Molecular Basis of Parkinson's Disease," 2007. [Link]
- "Supercomputers Simulate the Molecular Machines that Replicate and Repair DNA," 2010. [Link]
- Other Computational Science Application Domains
- Pixar, "RenderMan: The Technology Behind the Art," 2012. [Link]
- NOAA Center for Tsunami Research, "Japan (East Coast of Honshu) Tsunami, March 11, 2011," 2011 [Link]
- Southern California Earthquake Center, "SCEC's M8 earthquake simulation breaks computational records, promises faster and more detailed models of future earthquake," 2010. [Link]
- SciDAC Visualization Night 2011. [Link]
- HPC Systems
- Top 500 Supercomputer Sites. [Link]
- A. Petitet, R. C. Whaley, J. Dongarra, A. Cleary, "HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers," 2008. [Link]
- Oak Ridge Leadership Computing Facility (OLCF) Jaguar supercomputer. [Link]
- HPC Software
- The Trilinos Project. [Link]
- ScaLAPACK - Scalable Linear Algebra PACKage. [Link]
- Message Passing Interface (MPI) Forum. [Link]
- HPC Problems
- Advanced Scientific Computing Research (ASCR) Scientific Discovery through Advanced Computing (SciDAC), "The Challenges of Exascale," 2011. [Link]
- International Exascale Software Project. [Link]
- Fault Tolerance: Replication
- K. Ferreira, R. Riesen, R. Oldfield, J. Stearley, J. Laros, K. Pedretti, R. Brightwell, "rMPI: Increasing Fault Resiliency in a Message-Passing Environment," Sandia Report SAND2011-2488, 2011. [Link]
- Fault Tolerance: Checkpoint/Restart
- J. Hursey, "Coordinated Checkpoint/Restart Process Fault Tolerance for MPI Applications on HPC Systems," Ph.D. Thesis Indiana University, 2010. [Link]
- Fault Tolerance: Algorithm-Based Fault Tolerance (ABFT)
- K.-H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations," IEEE Transactions on Computers, 1984. [Link]
- HPC Software Ecosystem
- Open MPI. [Link]
- J. Hursey, R. Graham, "Preserving Collective Performance Across Process Failure for a Fault Tolerant MPI," 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) held in conjunction with the 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2011. [Link]
- J. Hursey, T. Naughton, G. Vallee, R. Graham, "A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI," EuroMPI 2011: Proceedings of the 18th EuroMPI Conference, 2011. [Link]
- HFODD J. Dobaczewski, et. al., "Solution of the Skyrme-Hartree-Fock-Bogolyubov equations in the Cartesian deformed harmonic-oscillator basis. (VI) HFODD (v2.40h): A new version of the program", Computer Physics Communications, 2009. [Link]
- MPI Testing Tool. [Link]
- Conclusions
- Oak Ridge Institute for Science and Education. [Link]
- Oak Ridge Leadership Computing Facility (OLCF). [Link]
- DOE Innovative and Novel Computational Impact on Theory and Experiment (INCITE) Program. [Link]
- Extreme Science and Engineering Discovery Environment (XSEDE). [Link]
- LittleFe: Computational Science Education on the Move. [Link]
- Shodor Foundation. [Link]
- Open MPI. [Link]
- Message Passing Interface (MPI) Forum. [Link]