Table of Contents
Fetching ...

WRATH: Workload Resilience Across Task Hierarchies in Task-based Parallel Programming Frameworks

Sicheng Zhou, Zhuozhao Li, Valérie Hayot-Sasson, Haochen Pan, Maxime Gonthier, J. Gregory Pauloski, Ryan Chard, Kyle Chard, Ian Foster

TL;DR

WRATH addresses resilience in task-based parallel programming by categorizing failures across four TBPP layers and deploying a distributed hierarchical monitoring system plus an intelligent resilience module that performs hierarchical retries. By moving beyond uniform retry strategies, WRATH enables targeted recovery actions based on root-cause analysis and resource context. The authors implement Wrath in Parsl, featuring a failure-categorization engine and a policy engine, and validate it with TaPS benchmarks under failure injection, achieving higher task success rates, lower time-to-failure, and modest overhead. This work demonstrates practical improvements in TBPP robustness and outlines directions for extending support to compiled languages and broader TBPP frameworks.

Abstract

Failures in Task-based Parallel Programming (TBPP) can severely degrade performance and result in incomplete or incorrect outcomes. Existing failure-handling approaches, including reactive, proactive, and resilient methods such as retry and checkpointing mechanisms, often apply uniform retry mechanisms regardless of the root cause of failures, failing to account for the unique characteristics of TBPP frameworks such as heterogeneous resource availability and task-level failures. To address these limitations, we propose WRATH, a novel systematic approach that categorizes failures based on the unique layered structure of TBPP frameworks and defines specific responses to address failures at different layers. WRATH combines a distributed monitoring system and a resilient module to collaboratively address different types of failures in real time. The monitoring system captures execution and resource information, reports failures, and profiles tasks across different layers of TBPP frameworks. The resilient module then categorizes failures and responds with appropriate actions, such as hierarchically retrying failed tasks on suitable resources. Evaluations demonstrate that WRATH significantly improves TBPP robustness, tripling the task success rate and maintaining an application success rate of over 90% for resolvable failures. Additionally, WRATH can reduce the time to failure by 20%-50%, allowing tasks that are destined to fail to be identified and fail more quickly.

WRATH: Workload Resilience Across Task Hierarchies in Task-based Parallel Programming Frameworks

TL;DR

WRATH addresses resilience in task-based parallel programming by categorizing failures across four TBPP layers and deploying a distributed hierarchical monitoring system plus an intelligent resilience module that performs hierarchical retries. By moving beyond uniform retry strategies, WRATH enables targeted recovery actions based on root-cause analysis and resource context. The authors implement Wrath in Parsl, featuring a failure-categorization engine and a policy engine, and validate it with TaPS benchmarks under failure injection, achieving higher task success rates, lower time-to-failure, and modest overhead. This work demonstrates practical improvements in TBPP robustness and outlines directions for extending support to compiled languages and broader TBPP frameworks.

Abstract

Failures in Task-based Parallel Programming (TBPP) can severely degrade performance and result in incomplete or incorrect outcomes. Existing failure-handling approaches, including reactive, proactive, and resilient methods such as retry and checkpointing mechanisms, often apply uniform retry mechanisms regardless of the root cause of failures, failing to account for the unique characteristics of TBPP frameworks such as heterogeneous resource availability and task-level failures. To address these limitations, we propose WRATH, a novel systematic approach that categorizes failures based on the unique layered structure of TBPP frameworks and defines specific responses to address failures at different layers. WRATH combines a distributed monitoring system and a resilient module to collaboratively address different types of failures in real time. The monitoring system captures execution and resource information, reports failures, and profiles tasks across different layers of TBPP frameworks. The resilient module then categorizes failures and responds with appropriate actions, such as hierarchically retrying failed tasks on suitable resources. Evaluations demonstrate that WRATH significantly improves TBPP robustness, tripling the task success rate and maintaining an application success rate of over 90% for resolvable failures. Additionally, WRATH can reduce the time to failure by 20%-50%, allowing tasks that are destined to fail to be identified and fail more quickly.

Paper Structure

This paper contains 26 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Typical architecture of TBPP frameworks. In the Application Layer, users define applications into tasks using the provided programming interfaces. The Framework Layer orchestrates the execution of tasks. The Runtime Layer allocates resources to tasks. The Environment Layer manages the underlying infrastructure and package dependencies.
  • Figure 2: Flow of the failure categorization engine and resilience policy engine. FTL: Failure Taxonomy Library.
  • Figure 3: Wrath system architecture diagram. Components in yellow and orange denote components of Wrath. MA: Monitoring Agent. R: Communication Radio.
  • Figure 4: Normalized time to failure for the applications with different failure types when Wrath is enabled. All results are normalized to those without Wrath. Failure rate = 0.3, Nodes = 32. Error bars represent the standard error of the mean (SEM) across 10 independent runs. All of the trials failed here, but those with Wrath failed fast.
  • Figure 5: Overhead ratio of Wrath on successful runs of each application with a pre-set failure rate of 0.1 on 32 nodes.
  • ...and 3 more figures