Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Reitz, Lukas; Fohry, Claudia

🇬🇧

Aufsatz

Zusammenfassung

🇬🇧

Exascale supercomputers consist of millions of processing units, and this number is still growing. Therefore, hardware failures, such as permanent node failures, become increasingly frequent. They can be tolerated with system-level Checkpoint/Restart, which saves the whole application state transparently and, if needed, restarts the application from the saved state; or with application-level checkpointing, which saves only relevant data via explicit calls in the program. The former approach requires no additional programming expense, whereas the latter is more efficient and allows to continue program execution after failures on the intact resources (localized shrinking recovery). An increasingly popular programming paradigm is asynchronous many-task (AMT) programming. Here, programmers identify parallel tasks, and a runtime system assigns the tasks to worker threads. Since tasks have clearly defined interfaces, the runtime system can automatically extract and save their interface data. This approach, called task-level checkpointing (TC), combines the respective strengths of system-level and application-level checkpointing. AMTs come in many variants, and so far, TC has only been applied to a few, rather simple variants. This paper considers TC for a different AMT variant: nested fork–join (NFJ) programs that run on clusters of multicore nodes under work stealing. We present the first TC scheme for this setting. It performs a localized shrinking recovery and can handle multiple node failures. In experiments with four benchmarks, we observed execution time overheads of around 44 % at 1536 workers, and negligible recovery costs. Additionally, we developed and experimentally validated a prediction model for the running times of the scheme.

Zitierform

In: SN Computer Science Volume 5 (2024-03-13) eissn:2661-8907

Förderhinweis

Gefördert im Rahmen des Projekts DEAL

Sammlung(en)

Artikel (Publikationen im Open Access gefördert durch die UB)

Zitieren

BibTex

@article{doi:10.17170/kobra-202403149779,
   author={Reitz, Lukas and Fohry, Claudia},
   title={Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters},
   journal={SN Computer Science},
   year={2024}
}

0500 Oax
0501 Text $btxt$2rdacontent
0502 Computermedien $bc$2rdacarrier
1100 2024$n2024
1500 1/eng
2050 ##0##http://hdl.handle.net/123456789/15562
3000 Reitz, Lukas
3010 Fohry, Claudia
4000 Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters / Reitz, Lukas
4030 
4060 Online-Ressource
4085 ##0##=u http://nbn-resolving.de/http://hdl.handle.net/123456789/15562=x R
4204 \$dAufsatz
4170 
5550 {{Programmierung}}
5550 {{Fehlertoleranz}}
5550 {{Fixpunkt <Datensicherung>}}
5550 {{Cluster}}
7136 ##0##http://hdl.handle.net/123456789/15562

<resource xsi:schemaLocation="http://datacite.org/schema/kernel-2.2 http://schema.datacite.org/meta/kernel-2.2/metadata.xsd">
2024-03-18T09:50:46Z
2024-03-18T09:50:46Z
2024-03-13
doi:10.17170/kobra-202403149779
http://hdl.handle.net/123456789/15562
Gef&ouml;rdert im Rahmen des Projekts DEAL
eng
Namensnennung 4.0 International
http://creativecommons.org/licenses/by/4.0/
asynchronous many-task programming
fault tolerance
task-level checkpointing
work stealing
004
Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork&ndash;Join Programs in Clusters
Aufsatz
Exascale supercomputers consist of millions of processing units, and this number is still growing. Therefore, hardware failures, such as permanent node failures, become increasingly frequent. They can be tolerated with system-level Checkpoint/Restart, which saves the whole application state transparently and, if needed, restarts the application from the saved state; or with application-level checkpointing, which saves only relevant data via explicit calls in the program. The former approach requires no additional programming expense, whereas the latter is more efficient and allows to continue program execution after failures on the intact resources (localized shrinking recovery). An increasingly popular programming paradigm is asynchronous many-task (AMT) programming. Here, programmers identify parallel tasks, and a runtime system assigns the tasks to worker threads. Since tasks have clearly defined interfaces, the runtime system can automatically extract and save their interface data. This approach, called task-level checkpointing (TC), combines the respective strengths of system-level and application-level checkpointing. AMTs come in many variants, and so far, TC has only been applied to a few, rather simple variants. This paper considers TC for a different AMT variant: nested fork&ndash;join (NFJ) programs that run on clusters of multicore nodes under work stealing. We present the first TC scheme for this setting. It performs a localized shrinking recovery and can handle multiple node failures. In experiments with four benchmarks, we observed execution time overheads of around 44 % at 1536 workers, and negligible recovery costs. Additionally, we developed and experimentally validated a prediction model for the running times of the scheme.
open access
Reitz, Lukas
Fohry, Claudia
doi:10.1007/s42979-024-02624-8
Programmierung
Fehlertoleranz
Fixpunkt &lt;Datensicherung&gt;
Cluster
publishedVersion
eissn:2661-8907
SN Computer Science
Volume 5
false
320
</resource>

Die folgenden Lizenzbestimmungen sind mit dieser Ressource verbunden:

Creative Commons Lizenz

Solange nicht anders angezeigt, wird die Lizenz wie folgt beschrieben: Namensnennung 4.0 International

Öffnen

Datum

Autor

Schlagwort

URI

Metadata