Table of Contents
Fetching ...

Fast Prototyping of Distributed Stream Processing Applications with stream2gym

Md. Monzurul Amin Ifath, Miguel Neves, Israat Haque

TL;DR

This work tackles the challenge of prototyping and testing distributed stream processing pipelines at scale, which traditionally relies on expensive testbeds or imperfect simulations. It introduces stream2gym, a Mininet-based platform with a GraphML-driven API that lets developers describe data flows, topology, and operational conditions while running real components (e.g., Kafka, Spark) on commodity hardware. The authors demonstrate the tool's ability to reproduce published research results and to test applications under varied networking scenarios, achieving hardware-like accuracy with low resource overhead and scaling to tens of components. Overall, stream2gym offers a practical, open-source solution for end-to-end testing, debugging, and reproducibility in distributed stream processing environments, potentially accelerating development and comparison across proposals.

Abstract

Stream processing applications have been widely adopted due to real-time data analytics demands, e.g., fraud detection, video analytics, IoT applications. Unfortunately, prototyping and testing these applications is still a cumbersome process for developers that usually requires an expensive testbed and deep multi-disciplinary expertise, including in areas such as networking, distributed systems, and data engineering. As a result, it takes a long time to deploy stream processing applications into production and yet users face several correctness and performance issues. In this paper, we present stream2gym, a tool for the fast prototyping of large-scale distributed stream processing applications. stream2gym builds on Mininet, a widely adopted network emulation platform, and provides a high-level interface to enable developers to easily test their applications under various operating conditions. We demonstrate the benefits of stream2gym by prototyping and testing several applications as well as reproducing key findings from prior research work in video analytics and network traffic monitoring. Moreover, we show stream2gym presents accurate results compared to a hardware testbed while consuming a small amount of resources (enough to be supported in a single commodity laptop even when emulating a dozen of processing nodes).

Fast Prototyping of Distributed Stream Processing Applications with stream2gym

TL;DR

This work tackles the challenge of prototyping and testing distributed stream processing pipelines at scale, which traditionally relies on expensive testbeds or imperfect simulations. It introduces stream2gym, a Mininet-based platform with a GraphML-driven API that lets developers describe data flows, topology, and operational conditions while running real components (e.g., Kafka, Spark) on commodity hardware. The authors demonstrate the tool's ability to reproduce published research results and to test applications under varied networking scenarios, achieving hardware-like accuracy with low resource overhead and scaling to tens of components. Overall, stream2gym offers a practical, open-source solution for end-to-end testing, debugging, and reproducibility in distributed stream processing environments, potentially accelerating development and comparison across proposals.

Abstract

Stream processing applications have been widely adopted due to real-time data analytics demands, e.g., fraud detection, video analytics, IoT applications. Unfortunately, prototyping and testing these applications is still a cumbersome process for developers that usually requires an expensive testbed and deep multi-disciplinary expertise, including in areas such as networking, distributed systems, and data engineering. As a result, it takes a long time to deploy stream processing applications into production and yet users face several correctness and performance issues. In this paper, we present stream2gym, a tool for the fast prototyping of large-scale distributed stream processing applications. stream2gym builds on Mininet, a widely adopted network emulation platform, and provides a high-level interface to enable developers to easily test their applications under various operating conditions. We demonstrate the benefits of stream2gym by prototyping and testing several applications as well as reproducing key findings from prior research work in video analytics and network traffic monitoring. Moreover, we show stream2gym presents accurate results compared to a hardware testbed while consuming a small amount of resources (enough to be supported in a single commodity laptop even when emulating a dozen of processing nodes).
Paper Structure (21 sections, 9 figures, 2 tables)

This paper contains 21 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: stream2gym architecture and workflow. SPE = Stream Processing Engine.
  • Figure 2: a) Example data processing pipeline; b) Target pipeline allocation into the emulated infrastructure.
  • Figure 3: Example YAML configurations for the a) data source; and b) word count components of the data processing pipeline described in Figure \ref{['fig:example-chain']}.
  • Figure 4: GraphML description for the data processing pipeline presented in Figure \ref{['fig:example-chain']}. We omit some lines due to space constraints.
  • Figure 5: End-to-end latency for the word count application when varying the link delay to reach out to each of its components. At each run, we increase the link delay of a single component and keep the remaining ones at a very low value ($<$10ms).
  • ...and 4 more figures