In this video from PASC18, Leonardo Bautista Gomez from the Barcelona Supercomputing Center presents: Easy and Efficient Multilevel Checkpointing for Extreme Scale Systems.
“Extreme scale supercomputers offer thousands of computing nodes to their users to satisfy their computing needs. As the need for massively parallel computing increases in industry, computing centers are being forced to increase in size and to transition to new computing technologies. While the advantage for the users is clear, such evolution imposes significant challenges, such as energy consumption and reliability. In this talk, we will discuss how to guarantee high reliability to high performance applications running in extreme scale supercomputers. In particular, we cover the tools necessary to implement scalable multilevel checkpointing for tightly coupled applications. This includes an overview of failure types and frequency in current HPC systems. The talk will also cover the theoretical analysis necessary to achieve optimal utilization of the computing resources. Moreover, we will discuss the internals of the FTI library tool, to study how multilevel checkpointing is implemented today.”
Thanks to Rich Brueckner from insideHPC Media Publications for recording the video.