Table of Contents
Fetching ...

SWARM+: Scalable and Resilient Multi-Agent Consensus for Fully-Decentralized Data-Aware Workload Management

Komal Thareja, Krishnan Raghavan, Anirban Mandal, Ewa Deelman

Abstract

Distributed scientific workflows increasingly span heterogeneous compute clusters, edge resources, and geo-distributed data repositories. In these environments, a centralized orchestrator is an architectural bottleneck -- introducing a single point of failure, limiting scalability, and constraining adaptability to changing resource availability or failures. Decentralized multi-agent coordination offers a compelling alternative: autonomous agents representing distributed resources collaboratively negotiate workload assignment (e.g., job selection) through peer-to-peer consensus, making decisions based on local compute capacity, data locality, and network conditions. However, scaling such systems for production workloads requires addressing challenges in coordination, resilience, and data-aware optimization. This work presents SWARM+, which builds on our prior work that demonstrated the feasibility of multi-agent decentralized consensus for distributed job selection. SWARM+ addresses three main problems: scalability of consensus for large numbers of agents, resilience of workload management under agent failure, and efficiency of job scheduling for highly distributed resources and data-intensive workloads. For each problem, we propose novel algorithms and evaluate them in the distributed FABRIC testbed. The results show that SWARM+ (a) scales to 1000 distributed agents with nearly equal workload distribution across the hierarchy levels and reduced coordination overhead due to hierarchical consensus, (b) is resilient to agent failures, maintaining >99% job completion rate under single agent failure, and demonstrating graceful system degradation, with at most 7.5% impact under 50% agent failures, and (c) achieves 97-98% improvement over baseline SWARM for both selection time and scheduling latency metrics.

SWARM+: Scalable and Resilient Multi-Agent Consensus for Fully-Decentralized Data-Aware Workload Management

Abstract

Distributed scientific workflows increasingly span heterogeneous compute clusters, edge resources, and geo-distributed data repositories. In these environments, a centralized orchestrator is an architectural bottleneck -- introducing a single point of failure, limiting scalability, and constraining adaptability to changing resource availability or failures. Decentralized multi-agent coordination offers a compelling alternative: autonomous agents representing distributed resources collaboratively negotiate workload assignment (e.g., job selection) through peer-to-peer consensus, making decisions based on local compute capacity, data locality, and network conditions. However, scaling such systems for production workloads requires addressing challenges in coordination, resilience, and data-aware optimization. This work presents SWARM+, which builds on our prior work that demonstrated the feasibility of multi-agent decentralized consensus for distributed job selection. SWARM+ addresses three main problems: scalability of consensus for large numbers of agents, resilience of workload management under agent failure, and efficiency of job scheduling for highly distributed resources and data-intensive workloads. For each problem, we propose novel algorithms and evaluate them in the distributed FABRIC testbed. The results show that SWARM+ (a) scales to 1000 distributed agents with nearly equal workload distribution across the hierarchy levels and reduced coordination overhead due to hierarchical consensus, (b) is resilient to agent failures, maintaining >99% job completion rate under single agent failure, and demonstrating graceful system degradation, with at most 7.5% impact under 50% agent failures, and (c) achieves 97-98% improvement over baseline SWARM for both selection time and scheduling latency metrics.
Paper Structure (16 sections, 3 equations, 6 figures, 4 tables)

This paper contains 16 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Hierarchical job selection (2-level scenario). Level 1 agents ($A_i$) select jobs from dynamic global distributed pool via consensus and delegate to Level 0 groups; Level 0 agents ($a_1..a_n$, $b_1..b_m$) perform local job selection via consensus within respective groups for resource allocation.
  • Figure 2: SWARM+ architecture and hierarchical topology. (Left) Three architectural layers: Hierarchical Multi-Agent System Layer, Consensus Layer, and Selection Layer. (Right) Hierarchical-110 deployment: 10 Level 0 groups (orange, 10 agents each) coordinated by 10 Level 1 agents (blue), demonstrating site-aligned organization that confines intra-group consensus to local meshes.
  • Figure 3: Multi-site deployment across 10 FABRIC sites (110 agent scenario).
  • Figure 4: Mean selection time distribution for Hier-110 (110 agents, 1000 jobs). Level 0 agents achieve 0.99 s mean selection time handling 50.1% of jobs (1001); Level 1 agents achieve 1.10 s mean handling 49.8% of jobs (994).
  • Figure 5: Agent failure recovery under three failure scenarios and using two kinds of failure detection methods -- gRPC-based (with Redis enabled as fallback) and Redis-based.
  • ...and 1 more figures