Table of Contents
Fetching ...

Formal Specification for Fast ACS: Low-Latency File-Based Ordered Message Delivery at Scale

Sushant Kumar Gupta, Anil Raghunath Iyer, Chang Yu, Neel Bagora, Olivier Pomerleau, Vivek Kumar, Prunthaban Kanthakumar

TL;DR

Fast ACS presents a file-based, multi-layer storage system for low-latency, ordered message delivery at internet scale. It combines intra-cluster RMA reads with inter-cluster RPC transfers, uses a Prim MST copy-tree for efficient cross-cluster replication, and employs a two-tier cache (data and metadata) atop Colossus to achieve high throughput with low tail latency. The design is validated through large-scale experiments showing sub-second p99 delays and Tbps-scale bandwidth, along with extensive production experience in Google Ads context. A complementary formal specification in TLA+ demonstrates safety and eventual progress for the dueling-writers cache model, underpinning the system's reliability guarantees.

Abstract

Low-latency message delivery is crucial for real-time systems. Data originating from a producer must be delivered to consumers, potentially distributed in clusters across metropolitan and continental boundaries. With the growing scale of computing, there can be several thousand consumers of the data. Such systems require a robust messaging system capable of transmitting messages containing data across clusters and efficiently delivering them to consumers. The system must offer guarantees like ordering and at-least-once delivery while avoiding overload on consumers, allowing them to consume messages at their own pace. This paper presents the design of Fast ACS (an abbreviation for Ads Copy Service), a file-based ordered message delivery system that leverages a combination of two-sided (inter-cluster) and one-sided (intra-cluster) communication primitives - namely, Remote Procedure Call and Remote Memory Access, respectively - to deliver messages. The system has been successfully deployed to dozens of production clusters and scales to accommodate several thousand consumers within each cluster, which amounts to Tbps-scale intra-cluster consumer traffic at peak. Notably, Fast ACS delivers messages to consumers across the globe within a few seconds or even sub-seconds (p99) based on the message volume and consumer scale, at a low resource cost.

Formal Specification for Fast ACS: Low-Latency File-Based Ordered Message Delivery at Scale

TL;DR

Fast ACS presents a file-based, multi-layer storage system for low-latency, ordered message delivery at internet scale. It combines intra-cluster RMA reads with inter-cluster RPC transfers, uses a Prim MST copy-tree for efficient cross-cluster replication, and employs a two-tier cache (data and metadata) atop Colossus to achieve high throughput with low tail latency. The design is validated through large-scale experiments showing sub-second p99 delays and Tbps-scale bandwidth, along with extensive production experience in Google Ads context. A complementary formal specification in TLA+ demonstrates safety and eventual progress for the dueling-writers cache model, underpinning the system's reliability guarantees.

Abstract

Low-latency message delivery is crucial for real-time systems. Data originating from a producer must be delivered to consumers, potentially distributed in clusters across metropolitan and continental boundaries. With the growing scale of computing, there can be several thousand consumers of the data. Such systems require a robust messaging system capable of transmitting messages containing data across clusters and efficiently delivering them to consumers. The system must offer guarantees like ordering and at-least-once delivery while avoiding overload on consumers, allowing them to consume messages at their own pace. This paper presents the design of Fast ACS (an abbreviation for Ads Copy Service), a file-based ordered message delivery system that leverages a combination of two-sided (inter-cluster) and one-sided (intra-cluster) communication primitives - namely, Remote Procedure Call and Remote Memory Access, respectively - to deliver messages. The system has been successfully deployed to dozens of production clusters and scales to accommodate several thousand consumers within each cluster, which amounts to Tbps-scale intra-cluster consumer traffic at peak. Notably, Fast ACS delivers messages to consumers across the globe within a few seconds or even sub-seconds (p99) based on the message volume and consumer scale, at a low resource cost.

Paper Structure

This paper contains 16 sections, 14 figures, 1 algorithm.

Figures (14)

  • Figure 1: Results for Kafka consumer-scaling experiment. As consumers scaled, the consumer throughput went up to 14 Gbps, after which the performance degraded.
  • Figure 2: File-based ordered messaging for a real-time systems. Each message stream can have multiple destination clusters based on its consumers. Message files are tail-copied from the source cluster's storage to the destination clusters' storage.
  • Figure 3: The multi-layer storage design. The producers/writers write the bytes to the Colossus file (1) and shadow the bytes as chunks to the data cache (2), and then update the length of the file in the metadata cache (3). The consumers/readers poll the metadata cache for the latest length of the file (4) and read the chunks from the data cache preferably (5), with Colossus acting as a fallback (5'). All the reads from caches happen over RMA. In this illustration, chunk 0 has expired while chunk 2 is only partially-filled. The length stored in the metadata cache points to the current position within this last, partially-filled chunk.
  • Figure 4: An MST for a message stream. The source cluster is in the US, and the stream is copied to three clusters within the US, two clusters in Europe, and four clusters in the Asia-Pacific region.
  • Figure 5: An illustration of relaxed reads followed by consistent reads. Suppose new bytes are written across multiple chunks, filling up the previously partially-filled chunk 0. The new chunk 1 is successfully written to all replicas, but the updated chunk 0 is only written to replicas 2 and 3. When a reader detects a new length from the metadata cache and performs a relaxed read for chunk 0 from replica 1, it will receive fewer bytes than expected. The reader will then perform a consistent read to obtain the complete chunk 0.
  • ...and 9 more figures