In this video from PASC18, Yves Robert from École normale supérieure de Lyon in France presents: Recent Results and Open Problems for Resilience at Scale.
“The talk will address the following three questions: (i) fail-stop errors: checkpointing or replication or both? (ii) silent errors: application-specific detectors or plain old trustworthy replication? (iii) workflows: how to avoid checkpointing every task?”
Thanks to Rich Brueckner from insideHPC Media Publications for recording the video.