Table of Contents
Fetching ...

Streaming Technologies and Serialization Protocols: Empirical Performance Analysis

Samuel Jackson, Nathan Cummings, Saiful Khan

TL;DR

This study uncovers significant performance differences and trade-offs between these technologies, providing valuable insights that can guide the selection of optimal streaming and serialization solutions for modern data-intensive applications.

Abstract

Efficient data streaming is essential for real-time data analytics, visualization, and machine learning model training, particularly when dealing with high-volume datasets. Various streaming technologies and serialization protocols have been developed to cater to different streaming requirements, each performing differently depending on specific tasks and datasets involved. This variety poses challenges in selecting the most appropriate combination, as encountered during the implementation of streaming system for the MAST fusion device data or SKA's radio astronomy data. To address this challenge, we conducted an empirical study on widely used data streaming technologies and serialization protocols. We also developed an extensible, open-source software framework to benchmark their efficiency across various performance metrics. Our study uncovers significant performance differences and trade-offs between these technologies, providing valuable insights that can guide the selection of optimal streaming and serialization solutions for modern data-intensive applications. Our goal is to equip the scientific community and industry professionals with the knowledge needed to enhance data streaming efficiency for improved data utilization and real-time analysis.

Streaming Technologies and Serialization Protocols: Empirical Performance Analysis

TL;DR

This study uncovers significant performance differences and trade-offs between these technologies, providing valuable insights that can guide the selection of optimal streaming and serialization solutions for modern data-intensive applications.

Abstract

Efficient data streaming is essential for real-time data analytics, visualization, and machine learning model training, particularly when dealing with high-volume datasets. Various streaming technologies and serialization protocols have been developed to cater to different streaming requirements, each performing differently depending on specific tasks and datasets involved. This variety poses challenges in selecting the most appropriate combination, as encountered during the implementation of streaming system for the MAST fusion device data or SKA's radio astronomy data. To address this challenge, we conducted an empirical study on widely used data streaming technologies and serialization protocols. We also developed an extensible, open-source software framework to benchmark their efficiency across various performance metrics. Our study uncovers significant performance differences and trade-offs between these technologies, providing valuable insights that can guide the selection of optimal streaming and serialization solutions for modern data-intensive applications. Our goal is to equip the scientific community and industry professionals with the knowledge needed to enhance data streaming efficiency for improved data utilization and real-time analysis.
Paper Structure (20 sections, 13 figures, 3 tables)

This paper contains 20 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Illustrates the data flow from producer to consumer, indicating the places at which various performance metrics are recorded. These metrics include (1) $L_o$: object creation latency, (2) $T_o$: object creation throughput, (3) $C$: compression ratio, (4) $L_s$: serialization latency, (5) $L_d$ deserialization latency, (6) $T_s$: serialization throughput, (7) $T_d$: deserialization throughput, (8) $L_{trans}$: transmission latency, (9) $T_{trans}$: transmission throughput, (10) $L_{tot}$: total latency, and (11) $T_{tot}$: total throughput.
  • Figure 2: Diagram showing the architecture of our streaming framework. A Runner is used to create a Producer and Consumer pair for each type of streaming technology. Both producer and consumer are instantiated with a Marshaler that encodes data to the desired format (e.g. JSON, ProtoBuf, etc.). Producers are created with a data stream object that generates data samples for transmission. Depending on the streaming method, the Consumer and Producer may connect to an external message broker.
  • Figure 3: Object creation latency ($L_o$), measured in milliseconds (ms), of various data types arranged in the x-axis and serialization methods shown in colored bars
  • Figure 4: Object creation throughput, ($T_o$) measured in megabytes per second (MB/s), of various data types arranged in the x-axis and serialization methods shown in colored bars
  • Figure 5: The compression ratio ($C$) of various data types arranged in the x-axis and serialization methods shown in colored bars
  • ...and 8 more figures