The Cost of Garbage Collection for State Machine Replication
Zhiying Liang, Vahab Jabrayilov, Aleksey Charapko, Abutalib Aghayev
TL;DR
The paper tackles the problem of GC overhead in cloud-based state machine replication by building Replicant, a MultiPaxos-based in-memory KV store, in C++, Java, Rust, and Go and evaluating it under realistic AWS/YCSB workloads. It combines architectural design, implementation choices, and extensive experiments to quantify how GC affects throughput and tail latency, especially under memory and vCPU constraints. The key contributions are (1) a modular Replicant design implemented across four languages, (2) a systematic comparison of GC vs manual memory management in a production-like SMR system, and (3) practical insights into how language choice impacts long-term cloud costs and performance under cloud virtualization. The findings indicate that manual memory management can yield 1–2 orders of magnitude throughput benefits in tight tail-latency regimes and that cloud systems built with GC risk higher cloud costs and degraded tail latency at scale, guiding decisions for cloud system design and language selection.
Abstract
State Machine Replication (SMR) protocols form the backbone of many distributed systems. Enterprises and startups increasingly build their distributed systems on the cloud due to its many advantages, such as scalability and cost-effectiveness. One of the first technical questions companies face when building a system on the cloud is which programming language to use. Among many factors that go into this decision is whether to use a language with garbage collection (GC), such as Java or Go, or a language with manual memory management, such as C++ or Rust. Today, companies predominantly prefer languages with GC, like Go, Kotlin, or even Python, due to ease of development; however, there is no free lunch: GC costs resources (memory and CPU) and performance (long tail latencies due to GC pauses). While there have been anecdotal reports of reduced cloud cost and improved tail latencies when switching from a language with GC to a language with manual memory management, so far, there has not been a systematic study of the GC overhead of running an SMR-based cloud system. This paper studies the overhead of running an SMR-based cloud system written in a language with GC. To this end, we design from scratch a canonical SMR system -- a MultiPaxos-based replicated in-memory key-value store -- and we implement it in C++, Java, Rust, and Go. We compare the performance and resource usage of these implementations when running on the cloud under different workloads and resource constraints and report our results. Our findings have implications for the design of cloud systems.
