Table of Contents
Fetching ...

COSTREAM: Learned Cost Models for Operator Placement in Edge-Cloud Environments

Roman Heinrich, Carsten Binnig, Harald Kornmayer, Manisha Luthra

TL;DR

COSTREAM tackles the challenge of optimal initial operator placement for streaming queries in heterogeneous edge-cloud settings by learning a zero-shot cost model. It introduces a joint operator-resource graph and a GNN-based cost predictor that generalizes to unseen hardware and workloads, enabling effective offline placement without runtime statistics. The approach yields substantial placement speed-ups (up to ~21x) and demonstrates strong predictive accuracy across varied hardware, query structures, and benchmarks, supported by a new large-scale cost-estimation dataset. This work advances cost-based optimization for edge-cloud DSPS and lays groundwork for broader offline optimizations in heterogeneous environments.

Abstract

In this work, we present COSTREAM, a novel learned cost model for Distributed Stream Processing Systems that provides accurate predictions of the execution costs of a streaming query in an edge-cloud environment. The cost model can be used to find an initial placement of operators across heterogeneous hardware, which is particularly important in these environments. In our evaluation, we demonstrate that COSTREAM can produce highly accurate cost estimates for the initial operator placement and even generalize to unseen placements, queries, and hardware. When using COSTREAM to optimize the placements of streaming operators, a median speed-up of around 21x can be achieved compared to baselines.

COSTREAM: Learned Cost Models for Operator Placement in Edge-Cloud Environments

TL;DR

COSTREAM tackles the challenge of optimal initial operator placement for streaming queries in heterogeneous edge-cloud settings by learning a zero-shot cost model. It introduces a joint operator-resource graph and a GNN-based cost predictor that generalizes to unseen hardware and workloads, enabling effective offline placement without runtime statistics. The approach yields substantial placement speed-ups (up to ~21x) and demonstrates strong predictive accuracy across varied hardware, query structures, and benchmarks, supported by a new large-scale cost-estimation dataset. This work advances cost-based optimization for edge-cloud DSPS and lays groundwork for broader offline optimizations in heterogeneous environments.

Abstract

In this work, we present COSTREAM, a novel learned cost model for Distributed Stream Processing Systems that provides accurate predictions of the execution costs of a streaming query in an edge-cloud environment. The cost model can be used to find an initial placement of operators across heterogeneous hardware, which is particularly important in these environments. In our evaluation, we demonstrate that COSTREAM can produce highly accurate cost estimates for the initial operator placement and even generalize to unseen placements, queries, and hardware. When using COSTREAM to optimize the placements of streaming operators, a median speed-up of around 21x can be achieved compared to baselines.
Paper Structure (20 sections, 3 equations, 14 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 3 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: Estimation errors when predicting E2E-latency for queries that are similar to the training data (left) or entirely unseen in terms of underlying hardware and other query properties (right). Costream can precisely predict query execution costs compared to an existing cost model baseline (Flat Vector).
  • Figure 2: Overview of our approach, which uses a learned cost-model Costream for operator placement. Costream is trained by a zero-shot approach on a broad set of workloads and hardware and thus can infer costs even for unseen workloads and hardware.
  • Figure 3: Cost estimation with Costream: ① At first, a dsps query is represented as a graph of operators that form a dag. ② For each operator, a corresponding host is selected, at which it should be located. This is the operator placement. ③ The data streams, query operators, and hardware nodes, as well as the data flow and placement are represented as a common learnable graph. ④ Each graph node is described with transferable features that are embedded by node-type specific encoders into hidden states. ⑤ These states are passed along a message passing scheme through the graph. A separate MLP finally transforms the hidden state into a cost prediction.
  • Figure 4: Optimizer model. ① The operator and hardware nodes are described using transferable features. ② $k$ placement candidates are generated by randomly distributing the operators to the hardware nodes and using parallel Costream instances to predict the query execution costs. ③ We average predictions of the target metric ($L_p$ in this example), filter out candidates that are predicted as being backpressured or unsuccessful, and choose the one with the lowest cost, which is the resulting placement.
  • Figure 5: Rules for placement enumeration in our benchmark. ① Operator co-location, ② increasing computing capability along the physical data flow, ③ acyclic placements.
  • ...and 9 more figures

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8