SLOFetch: Compressed-Hierarchical Instruction Prefetching for Cloud Microservices
Zerui Bao, Di Zhu, Liu Jiang, Shiqi Sheng, Ziwei Wang, Haoyun Zhang
TL;DR
This work addresses tail latency in cloud microservices caused by large instruction footprints and frontend stalls by rethinking instruction prefetching under tight on-chip budgets. It introduces a compressed 36-bit destination entry and a hierarchical metadata storage scheme, enabling dense L1 attachment while virtualizing bulk state into lower cache levels, and couples this with an online ML controller that uses a contextual bandit to adaptively trigger prefetches. The approach—CEIP/CHEIP—achieves EIP-like speedups with significantly smaller on-chip state and improved accuracy, MPKI reductions, and controlled bandwidth usage, as demonstrated in trace-driven ZSim evaluations on realistic service mixes. The paper also outlines deployment guidelines (shadow mode, guarded canaries, ramp) and discusses security, privacy, and practical integration concerns, highlighting the practical impact for data-center and edge environments where SLOs are critical and silicon budgets are constrained. It contributes a concrete ML-guided, SLO-aligned prefetching solution that blends hardware-friendly metadata compression, hierarchical placement, and adaptive decision making to reduce frontend latency in cloud microservices. $U = \alpha \cdot \Delta \text{P95}^{-} + \beta \cdot \Delta \text{MPKI}^{-} - \gamma \cdot \text{BW}^{+} - \delta \cdot \text{Evict}^{+}$ is used to formalize the trade-offs between latency gains, miss reductions, bandwidth, and evictions.
Abstract
Large-scale networked services rely on deep soft-ware stacks and microservice orchestration, which increase instruction footprints and create frontend stalls that inflate tail latency and energy. We revisit instruction prefetching for these cloud workloads and present a design that aligns with SLO driven and self optimizing systems. Building on the Entangling Instruction Prefetcher (EIP), we introduce a Compressed Entry that captures up to eight destinations around a base using 36 bits by exploiting spatial clustering, and a Hierarchical Metadata Storage scheme that keeps only L1 resident and frequently queried entries on chip while virtualizing bulk metadata into lower levels. We further add a lightweight Online ML Controller that scores prefetch profitability using context features and a bandit adjusted threshold. On data center applications, our approach preserves EIP like speedups with smaller on chip state and improves efficiency for networked services in the ML era.
