TRACE: Unlocking Effective CXL Bandwidth via Lossless Compression and Precision Scaling

Rui Xie; Asad Ul Haq; Yunhua Fang; Linsen Ma; Zirak Burzin Engineer; Liu Liu; Tong Zhang

TRACE: Unlocking Effective CXL Bandwidth via Lossless Compression and Precision Scaling

Rui Xie, Asad Ul Haq, Yunhua Fang, Linsen Ma, Zirak Burzin Engineer, Liu Liu, Tong Zhang

TL;DR

TRACE tackles the memory-bandwidth bottleneck in LLM inference by rethinking the CXL device's internal tensor layout rather than altering software. It introduces bit-plane disaggregation and KV-specific preprocessing to enable lossless, structure-aware compression and elastic precision fetch, while preserving the unmodified CXL.mem interface. The approach yields substantial gains: up to $46.9\%$ lossless KV footprint reduction and $25.2\%$ weight footprint reduction on BF16, with trace-driven throughput boosts (e.g., $4.24\times$ at 128k tokens for GPT-OSS-120B-MXFP4) and DRAM energy savings up to $40.3\%$, accompanied by modest hardware overhead ($+7.2\%$ area, $+4.7\%$ power). Collectively, TRACE demonstrates that structure-aware, bit-plane storage with precision-graded access can dramatically improve end-to-end performance in CXL-backed LLM deployments.

Abstract

LLM inference is increasingly limited by memory bandwidth, and the bottleneck worsens at long context as the KV cache grows. CXL memory adds capacity to offload weights and KV, but its link and device-side DDR bandwidth are far below HBM, so decoding stalls once traffic shifts to the CXL tier. Many CXL controllers are starting to add generic \emph{lossless} compression, yet applying commodity codecs directly to standard word-major LLM tensors is largely ineffective, especially for token-major KV streams. We propose TRACE (\textbf{T}raffic-\textbf{R}educed \textbf{A}rchitecture for \textbf{C}ompression and \textbf{E}lasticity), which preserves the unmodified CXL.mem interface but changes the device-internal representation. It stores tensors in a channel-major, disaggregated bit-plane layout, and applies a KV-specific transform before compression, converting mixed-field words into low-entropy plane streams that commodity codecs can compress. The same substrate enables precision-proportional fetch by reading only the required bit-planes. Across public LLMs, TRACE reduces BF16 weight footprint by 25.2\% and BF16 KV footprint by 46.9\% losslessly, with per-layer KV ratios peaking at 2.69$\times$. In trace-driven system modeling, once KV spills to CXL, GPT-OSS-120B-MXFP4 improves throughput at 128k tokens from 16.28 to 68.99 tok/s (4.24$\times$). DRAMSim3 shows up to 40.3\% lower DRAM access energy under plane-aligned fetch. A 7\,nm SystemVerilog implementation sustains 256\,GB/s device bandwidth. Relative to a CXL controller with generic inline lossless compression, TRACE only adds 7.2\% area, 4.7\% power, and 6.0\% load-to-use latency at 2\,GHz and 0.7\,V.

TRACE: Unlocking Effective CXL Bandwidth via Lossless Compression and Precision Scaling

TL;DR

Abstract

TRACE: Unlocking Effective CXL Bandwidth via Lossless Compression and Precision Scaling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)