Table of Contents
Fetching ...

Towards a Peer-to-Peer Data Distribution Layer for Efficient and Collaborative Resource Optimization of Distributed Dataflow Applications

Dominik Scheinert, Soeren Becker, Jonathan Will, Luis Englaender, Lauritz Thamsen

TL;DR

The paper tackles the challenge of limited performance-modeling data for distributed dataflow applications and proposes a decentralized, peer-to-peer data distribution layer to enable collaborative modeling. It designs an IPFS-based architecture (including IPFS-Log and OrbitDB) that organizes data into a contributions store, supports data validation, and automates data contributions after executions. Through a prototype (Pe ersDB) and simulations (Testground), it demonstrates the feasibility of scalable, fault-tolerant data sharing across geographically distributed peers and discusses trade-offs, potential enhancements (e.g., blockchain-backed guarantees), and incentive considerations. The work is significant for enabling scalable, privacy-conscious collaboration in performance modeling without relying on a central data repository, potentially improving prediction accuracy and resource efficiency in distributed dataflow systems.

Abstract

Performance modeling can help to improve the resource efficiency of clusters and distributed dataflow applications, yet the available modeling data is often limited. Collaborative approaches to performance modeling, characterized by the sharing of performance data or models, have been shown to improve resource efficiency, but there has been little focus on actual data sharing strategies and implementation in production environments. This missing building block holds back the realization of proposed collaborative solutions. In this paper, we envision, design, and evaluate a peer-to-peer performance data sharing approach for collaborative performance modeling of distributed dataflow applications. Our proposed data distribution layer enables access to performance data in a decentralized manner, thereby facilitating collaborative modeling approaches and allowing for improved prediction capabilities and hence increased resource efficiency. In our evaluation, we assess our approach with regard to deployment, data replication, and data validation, through experiments with a prototype implementation and simulation, demonstrating feasibility and allowing discussion of potential limitations and next steps.

Towards a Peer-to-Peer Data Distribution Layer for Efficient and Collaborative Resource Optimization of Distributed Dataflow Applications

TL;DR

The paper tackles the challenge of limited performance-modeling data for distributed dataflow applications and proposes a decentralized, peer-to-peer data distribution layer to enable collaborative modeling. It designs an IPFS-based architecture (including IPFS-Log and OrbitDB) that organizes data into a contributions store, supports data validation, and automates data contributions after executions. Through a prototype (Pe ersDB) and simulations (Testground), it demonstrates the feasibility of scalable, fault-tolerant data sharing across geographically distributed peers and discusses trade-offs, potential enhancements (e.g., blockchain-backed guarantees), and incentive considerations. The work is significant for enabling scalable, privacy-conscious collaboration in performance modeling without relying on a central data repository, potentially improving prediction accuracy and resource efficiency in distributed dataflow systems.

Abstract

Performance modeling can help to improve the resource efficiency of clusters and distributed dataflow applications, yet the available modeling data is often limited. Collaborative approaches to performance modeling, characterized by the sharing of performance data or models, have been shown to improve resource efficiency, but there has been little focus on actual data sharing strategies and implementation in production environments. This missing building block holds back the realization of proposed collaborative solutions. In this paper, we envision, design, and evaluate a peer-to-peer performance data sharing approach for collaborative performance modeling of distributed dataflow applications. Our proposed data distribution layer enables access to performance data in a decentralized manner, thereby facilitating collaborative modeling approaches and allowing for improved prediction capabilities and hence increased resource efficiency. In our evaluation, we assess our approach with regard to deployment, data replication, and data validation, through experiments with a prototype implementation and simulation, demonstrating feasibility and allowing discussion of potential limitations and next steps.
Paper Structure (18 sections, 4 figures, 2 tables)

This paper contains 18 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the envisioned system. From the user's perspective, sharing and collecting data is abstracted away and takes place under the hood, so that the attention is directed toward performance modeling of dataflows.
  • Figure 2: The proposed workflows and actions. The data distribution layer represents a key component and facilitates data management and access for downstream tasks such as performance modeling for resource allocations.
  • Figure 3: Architecture of the prototype implementation. The API layer facilitates access and usage of the developed service routines, which are in sync with other peers. Internally, the storage solution is based on IPFS, but regulates access to certain data and offers simplified, database-like means of interaction.
  • Figure 4: Results of our experiments with the prototype implementation. Small differences in data transmission times can be observed, as well as the bootstrapping time being conditioned on PeersDB cluster sizes.