SwitchDelta: Asynchronous Metadata Updating for Distributed Storage with In-Network Data Visibility
Junru Li, Qing Wang, Zhe Yang, Shuo Liu, Jiwu Shu, Youyou Lu
TL;DR
Distributed storage systems with data/metadata separation must perform ordered writes to ensure linearizability, but this incurs latency and throughput overheads. SwitchΔ introduces an in-switch data visibility layer that buffers in-flight metadata updates and makes data visible immediately after the data write phase, while applying updates to metadata nodes asynchronously. The design combines a timestamped concurrency control, hash-based in-switch indexing, partial-write support, and batching techniques to achieve latency reductions and throughput gains across a log-structured KV store, a distributed file system, and a distributed secondary index, with robust failure handling. Empirical evaluations show median write latency reductions up to 52.4% and throughput improvements up to 126.9% under write-heavy workloads, along with practical deployment considerations in ToR-based data centers and resilience to switch failures.
Abstract
Distributed storage systems typically maintain strong consistency between data nodes and metadata nodes by adopting ordered writes: 1) first installing data; 2) then updating metadata to make data visible.We propose SwitchDelta to accelerate ordered writes by moving metadata updates out of the critical path. It buffers in-flight metadata updates in programmable switches to enable data visibility in the network and retain strong consistency. SwitchDelta uses a best-effort data plane design to overcome the resource limitation of switches and designs a novel metadata update protocol to exploit the benefits of in-network data visibility. We evaluate SwitchDelta in three distributed in-memory storage systems: log-structured key-value stores, file systems, and secondary indexes. The evaluation shows that SwitchDelta reduces the latency of write operations by up to 52.4% and boosts the throughput by up to 126.9% under write-heavy workloads.
