Table of Contents
Fetching ...

Ares II: Tracing the Flaws of a (Storage) God

Chryssis Georgiou, Nicolas Nicolaou, Andria Trigeorgi

TL;DR

The paper tackles performance bottlenecks in the Ares family of reconfigurable, fault-tolerant distributed shared memory systems by applying distributed tracing with OpenTelemetry/Jaeger. It identifies stable overheads in configuration discovery and bottlenecks tied to reconfiguration and large-object handling, then introduces three optimizations—piggybacking configuration data on data messages, garbage collecting obsolete configurations, and batching reconfigurations across multiple objects. The authors prove the correctness of Ares II and demonstrate performance gains across various Ares variants, including EC- and ABD-based DAPs, with fragmentation and coverability considerations. The work shows that tracing-driven optimizations can deliver substantial improvements in both storage efficiency and latency, preserving atomicity while enabling scalable, reconfigurable DSM for large objects and dynamic server sets.

Abstract

Ares is a modular framework, designed to implement dynamic, reconfigurable, fault-tolerant, read/write and strongly consistent distributed shared memory objects. Recent enhancements of the framework have realized the efficient implementation of large objects, by introducing versioning and data striping techniques. In this work, we identify performance bottlenecks of the Ares's variants by utilizing distributed tracing, a popular technique for monitoring and profiling distributed systems. We then propose optimizations across all versions of Ares, aiming in overcoming the identified flaws, while preserving correctness. We refer to the optimized version of Ares as Ares II, which now features a piggyback mechanism, a garbage collection mechanism, and a batching reconfiguration technique for improving the performance and storage efficiency of the original Ares. We rigorously prove the correctness of Ares II, and we demonstrate the performance improvements by an experimental comparison (via distributed tracing) of the Ares II variants with their original counterparts.

Ares II: Tracing the Flaws of a (Storage) God

TL;DR

The paper tackles performance bottlenecks in the Ares family of reconfigurable, fault-tolerant distributed shared memory systems by applying distributed tracing with OpenTelemetry/Jaeger. It identifies stable overheads in configuration discovery and bottlenecks tied to reconfiguration and large-object handling, then introduces three optimizations—piggybacking configuration data on data messages, garbage collecting obsolete configurations, and batching reconfigurations across multiple objects. The authors prove the correctness of Ares II and demonstrate performance gains across various Ares variants, including EC- and ABD-based DAPs, with fragmentation and coverability considerations. The work shows that tracing-driven optimizations can deliver substantial improvements in both storage efficiency and latency, preserving atomicity while enabling scalable, reconfigurable DSM for large objects and dynamic server sets.

Abstract

Ares is a modular framework, designed to implement dynamic, reconfigurable, fault-tolerant, read/write and strongly consistent distributed shared memory objects. Recent enhancements of the framework have realized the efficient implementation of large objects, by introducing versioning and data striping techniques. In this work, we identify performance bottlenecks of the Ares's variants by utilizing distributed tracing, a popular technique for monitoring and profiling distributed systems. We then propose optimizations across all versions of Ares, aiming in overcoming the identified flaws, while preserving correctness. We refer to the optimized version of Ares as Ares II, which now features a piggyback mechanism, a garbage collection mechanism, and a batching reconfiguration technique for improving the performance and storage efficiency of the original Ares. We rigorously prove the correctness of Ares II, and we demonstrate the performance improvements by an experimental comparison (via distributed tracing) of the Ares II variants with their original counterparts.
Paper Structure (32 sections, 15 theorems, 16 figures, 3 tables)

This paper contains 32 sections, 15 theorems, 16 figures, 3 tables.

Key Result

lemma 1

Let $\xi{$ξ$}$ be an execution of an algorithm $A$ that uses the EC-DAP II. If $\phi$ is a ${c}.{ \mathord$ get-tag$()}$ that returns $\tau_{\pi} \in {\mathcal{T}}$ or a ${c}.{ \mathord$ get-data$()}$ that returns $\langle{\pi}, v_\pi \rangle$⟨$\tau_{\pi}, v_\pi$⟩$\in {\mathcal{T}} \times {\

Figures (16)

  • Figure 1: The architecture of our implementation.
  • Figure 2: READ Operation - S:11, W:5, R:5, fsize: 512MB
  • Figure 3: READ Operation - algorithm: Ares EC, S:3, W:5, R:50, fsize:4MB, Debug Level:DSMM
  • Figure 4: READ Operation - algorithm: Ares EC, S:11, W:5, R:50, fsize:4MB, Debug Level:DSMM
  • Figure 5: READ Operation - algorithm: Co Ares EC F, S:11, W:5, R:5, fsize:512MB, Min/Avg Block Size:2MB, max Block Size:4MB, Debug Level:USER
  • ...and 11 more figures

Theorems & Definitions (16)

  • lemma 1: C2
  • lemma 2: C1
  • Theorem 1: Safety
  • Theorem 2: Liveness
  • definition 1: Subsequence
  • lemma 3
  • lemma 4
  • lemma 5: Configuration Uniqueness
  • lemma 6
  • lemma 7: Subsequence
  • ...and 6 more