Table of Contents
Fetching ...

Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression

Boris Kriuk, Logic Ng

TL;DR

This work tackles the bandwidth bottleneck in multi-agent LLM systems by shifting inter-agent communication from raw text to compressed KV caches. It introduces Q-KVComm, a protocol that combines adaptive layer-wise quantization, hybrid information extraction, and cross-model calibration to transmit semantically rich caches with 5-6x compression and minimal quality loss. The approach maintains robust performance across heterogeneous architectures and small-to-medium LLMs, and includes production-ready features like LRU caching and bit-packed serialization for edge deployments. The results demonstrate practical viability for bandwidth-constrained environments and pave the way for scalable, representation-based collaboration among multiple LLM agents.

Abstract

Multi-agent Large Language Model (LLM) systems face a critical bottleneck: redundant transmission of contextual information between agents consumes excessive bandwidth and computational resources. Traditional approaches discard internal semantic representations and transmit raw text, forcing receiving agents to recompute similar representations from scratch. We introduce Q-KVComm, a new protocol that enables direct transmission of compressed key-value (KV) cache representations between LLM agents. Q-KVComm combines three key innovations: (1) adaptive layer-wise quantization that allocates variable bit-widths based on sensitivity profiling, (2) hybrid information extraction that preserves critical facts across content domains, and (3) heterogeneous model calibration establishing cross-architecture communication. Extensive experiments across three diverse question-answering datasets demonstrate that Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios. The protocol exhibits robust performance across model sizes (1.1B-1.5B parameters) and adapts to real-world applications including conversational QA and multi-hop reasoning. Our work establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange.

Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression

TL;DR

This work tackles the bandwidth bottleneck in multi-agent LLM systems by shifting inter-agent communication from raw text to compressed KV caches. It introduces Q-KVComm, a protocol that combines adaptive layer-wise quantization, hybrid information extraction, and cross-model calibration to transmit semantically rich caches with 5-6x compression and minimal quality loss. The approach maintains robust performance across heterogeneous architectures and small-to-medium LLMs, and includes production-ready features like LRU caching and bit-packed serialization for edge deployments. The results demonstrate practical viability for bandwidth-constrained environments and pave the way for scalable, representation-based collaboration among multiple LLM agents.

Abstract

Multi-agent Large Language Model (LLM) systems face a critical bottleneck: redundant transmission of contextual information between agents consumes excessive bandwidth and computational resources. Traditional approaches discard internal semantic representations and transmit raw text, forcing receiving agents to recompute similar representations from scratch. We introduce Q-KVComm, a new protocol that enables direct transmission of compressed key-value (KV) cache representations between LLM agents. Q-KVComm combines three key innovations: (1) adaptive layer-wise quantization that allocates variable bit-widths based on sensitivity profiling, (2) hybrid information extraction that preserves critical facts across content domains, and (3) heterogeneous model calibration establishing cross-architecture communication. Extensive experiments across three diverse question-answering datasets demonstrate that Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios. The protocol exhibits robust performance across model sizes (1.1B-1.5B parameters) and adapts to real-world applications including conversational QA and multi-hop reasoning. Our work establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange.

Paper Structure

This paper contains 16 sections, 11 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of Q-KVComm Architecture.
  • Figure 2: Q-KVComm pipeline architecture with four main stages: layer selection using hybrid attention-Gaussian scoring, information extraction via YAKE and NER, adaptive quantization, and cross-model calibration before transmission.
  • Figure 3: Adaptive layer-wise quantization based on sensitivity $E_l$. Highly sensitive layers receive 8 bits, moderately sensitive layers receive 6 bits, and low sensitivity layers receive 4 bits, achieving an average 6-bit compression.
  • Figure 4: Performance trade-offs across quantization bit-widths. Left: Compression ratios showing consistent behavior across all datasets: $6.93\times$ at 4-bit (maximum compression), $5.68\times$ at 6-bit (balanced), and $5.06\times$ at 8-bit (speed-optimized). The uniform compression across datasets validates our adaptive layer selection strategy. Right: Inference time comparison revealing the computational cost of aggressive quantization. HotpotQA (red circles) requires longest processing due to multi-hop reasoning (20.36s, 22.13s, 13.02s), NarrativeQA (green triangles) shows moderate times for narrative understanding (11.71s, 13.18s, 7.91s), while SQuAD (blue squares) achieves fastest inference for extractive QA (9.65s, 9.84s, 7.47s). Note the counter-intuitive pattern where 6-bit is slower than 4-bit, reflecting quantization algorithm overhead, while 8-bit achieves optimal throughput.