Table of Contents
Fetching ...

TRACE: Unlocking Effective CXL Bandwidth via Lossless Compression and Precision Scaling

Rui Xie, Asad Ul Haq, Yunhua Fang, Linsen Ma, Zirak Burzin Engineer, Liu Liu, Tong Zhang

TL;DR

TRACE tackles the memory-bandwidth bottleneck in LLM inference by rethinking the CXL device's internal tensor layout rather than altering software. It introduces bit-plane disaggregation and KV-specific preprocessing to enable lossless, structure-aware compression and elastic precision fetch, while preserving the unmodified CXL.mem interface. The approach yields substantial gains: up to $46.9\%$ lossless KV footprint reduction and $25.2\%$ weight footprint reduction on BF16, with trace-driven throughput boosts (e.g., $4.24\times$ at 128k tokens for GPT-OSS-120B-MXFP4) and DRAM energy savings up to $40.3\%$, accompanied by modest hardware overhead ($+7.2\%$ area, $+4.7\%$ power). Collectively, TRACE demonstrates that structure-aware, bit-plane storage with precision-graded access can dramatically improve end-to-end performance in CXL-backed LLM deployments.

Abstract

LLM inference is increasingly limited by memory bandwidth, and the bottleneck worsens at long context as the KV cache grows. CXL memory adds capacity to offload weights and KV, but its link and device-side DDR bandwidth are far below HBM, so decoding stalls once traffic shifts to the CXL tier. Many CXL controllers are starting to add generic \emph{lossless} compression, yet applying commodity codecs directly to standard word-major LLM tensors is largely ineffective, especially for token-major KV streams. We propose TRACE (\textbf{T}raffic-\textbf{R}educed \textbf{A}rchitecture for \textbf{C}ompression and \textbf{E}lasticity), which preserves the unmodified CXL.mem interface but changes the device-internal representation. It stores tensors in a channel-major, disaggregated bit-plane layout, and applies a KV-specific transform before compression, converting mixed-field words into low-entropy plane streams that commodity codecs can compress. The same substrate enables precision-proportional fetch by reading only the required bit-planes. Across public LLMs, TRACE reduces BF16 weight footprint by 25.2\% and BF16 KV footprint by 46.9\% losslessly, with per-layer KV ratios peaking at 2.69$\times$. In trace-driven system modeling, once KV spills to CXL, GPT-OSS-120B-MXFP4 improves throughput at 128k tokens from 16.28 to 68.99 tok/s (4.24$\times$). DRAMSim3 shows up to 40.3\% lower DRAM access energy under plane-aligned fetch. A 7\,nm SystemVerilog implementation sustains 256\,GB/s device bandwidth. Relative to a CXL controller with generic inline lossless compression, TRACE only adds 7.2\% area, 4.7\% power, and 6.0\% load-to-use latency at 2\,GHz and 0.7\,V.

TRACE: Unlocking Effective CXL Bandwidth via Lossless Compression and Precision Scaling

TL;DR

TRACE tackles the memory-bandwidth bottleneck in LLM inference by rethinking the CXL device's internal tensor layout rather than altering software. It introduces bit-plane disaggregation and KV-specific preprocessing to enable lossless, structure-aware compression and elastic precision fetch, while preserving the unmodified CXL.mem interface. The approach yields substantial gains: up to lossless KV footprint reduction and weight footprint reduction on BF16, with trace-driven throughput boosts (e.g., at 128k tokens for GPT-OSS-120B-MXFP4) and DRAM energy savings up to , accompanied by modest hardware overhead ( area, power). Collectively, TRACE demonstrates that structure-aware, bit-plane storage with precision-graded access can dramatically improve end-to-end performance in CXL-backed LLM deployments.

Abstract

LLM inference is increasingly limited by memory bandwidth, and the bottleneck worsens at long context as the KV cache grows. CXL memory adds capacity to offload weights and KV, but its link and device-side DDR bandwidth are far below HBM, so decoding stalls once traffic shifts to the CXL tier. Many CXL controllers are starting to add generic \emph{lossless} compression, yet applying commodity codecs directly to standard word-major LLM tensors is largely ineffective, especially for token-major KV streams. We propose TRACE (\textbf{T}raffic-\textbf{R}educed \textbf{A}rchitecture for \textbf{C}ompression and \textbf{E}lasticity), which preserves the unmodified CXL.mem interface but changes the device-internal representation. It stores tensors in a channel-major, disaggregated bit-plane layout, and applies a KV-specific transform before compression, converting mixed-field words into low-entropy plane streams that commodity codecs can compress. The same substrate enables precision-proportional fetch by reading only the required bit-planes. Across public LLMs, TRACE reduces BF16 weight footprint by 25.2\% and BF16 KV footprint by 46.9\% losslessly, with per-layer KV ratios peaking at 2.69. In trace-driven system modeling, once KV spills to CXL, GPT-OSS-120B-MXFP4 improves throughput at 128k tokens from 16.28 to 68.99 tok/s (4.24). DRAMSim3 shows up to 40.3\% lower DRAM access energy under plane-aligned fetch. A 7\,nm SystemVerilog implementation sustains 256\,GB/s device bandwidth. Relative to a CXL controller with generic inline lossless compression, TRACE only adds 7.2\% area, 4.7\% power, and 6.0\% load-to-use latency at 2\,GHz and 0.7\,V.

Paper Structure

This paper contains 18 sections, 8 equations, 23 figures, 5 tables.

Figures (23)

  • Figure 1: The bandwidth and capacity gap in CXL memory. (A) Standard devices enforce a rigid, word-based layout. This creates two inefficiencies: capacity waste (high entropy prevents compression) and bandwidth waste (fetching full words even when low precision is needed). (B) TRACE transforms the internal representation. By restructuring data into an elastic layout, it enables structural compression (saving KV capacity) and precision-proportional fetching (saving weight bandwidth), amplifying the effective capability of the CXL tier.
  • Figure 2: Activations are structurally smoother along channels. Visualization of Llama-3-8B KV cache on Booksum (layer 0, head 6) shows that values change less rapidly along the channel dimension (y-axis) than across tokens (x-axis), indicating latent structure that a token-major byte stream can obscure.
  • Figure 3: Dynamic weight precision with MoDE: a block-level cap (Router 1) and per-expert assignments (Router 2).
  • Figure 4: MoDE per-expert precision vs prune-only on LLaMA-MoE-3.5B for PIQA bisk2020piqa, WinoGrande sakaguchi2021winogrande, LAMBADA paperno2016lambada, and MMLU hendrycks2020measuring.
  • Figure 5: Perplexity vs average bits/weight on OPT. Per-head/per-neuron precision control outperforms static-uniform at the same bits. Lower is better.
  • ...and 18 more figures