Table of Contents
Fetching ...

Residual Vector Quantization For Communication-Efficient Multi-Agent Perception

Dereje Shenkut, B. V. K Vijaya Kumar

TL;DR

The paper tackles the bandwidth bottleneck in multi-agent collaborative perception by introducing ReVQom, a learned feature codec that preserves spatial geometry while compressing BEV features. It achieves this with a simple channel-reducing bottleneck followed by multi-stage residual vector quantization using shared codebooks, transmitting only per-pixel indices at a rate $R = n_q \log_2 K$ per location. Empirical results on DAIR-V2X and OPV2V show dramatic compression (up to $1365\times$) with competitive detection performance and graceful degradation at ultra-low bitrates, enabling practical V2X deployment. This work demonstrates that aggressive, index-based compression can maintain BEV fusion quality and scalability in real-world CP scenarios.

Abstract

Multi-agent collaborative perception (CP) improves scene understanding by sharing information across connected agents such as autonomous vehicles, unmanned aerial vehicles, and robots. Communication bandwidth, however, constrains scalability. We present ReVQom, a learned feature codec that preserves spatial identity while compressing intermediate features. ReVQom is an end-to-end method that compresses feature dimensions via a simple bottleneck network followed by multi-stage residual vector quantization (RVQ). This allows only per-pixel code indices to be transmitted, reducing payloads from 8192 bits per pixel (bpp) of uncompressed 32-bit float features to 6-30 bpp per agent with minimal accuracy loss. On DAIR-V2X real-world CP dataset, ReVQom achieves 273x compression at 30 bpp to 1365x compression at 6 bpp. At 18 bpp (455x), ReVQom matches or outperforms raw-feature CP, and at 6-12 bpp it enables ultra-low-bandwidth operation with graceful degradation. ReVQom allows efficient and accurate multi-agent collaborative perception with a step toward practical V2X deployment.

Residual Vector Quantization For Communication-Efficient Multi-Agent Perception

TL;DR

The paper tackles the bandwidth bottleneck in multi-agent collaborative perception by introducing ReVQom, a learned feature codec that preserves spatial geometry while compressing BEV features. It achieves this with a simple channel-reducing bottleneck followed by multi-stage residual vector quantization using shared codebooks, transmitting only per-pixel indices at a rate per location. Empirical results on DAIR-V2X and OPV2V show dramatic compression (up to ) with competitive detection performance and graceful degradation at ultra-low bitrates, enabling practical V2X deployment. This work demonstrates that aggressive, index-based compression can maintain BEV fusion quality and scalability in real-world CP scenarios.

Abstract

Multi-agent collaborative perception (CP) improves scene understanding by sharing information across connected agents such as autonomous vehicles, unmanned aerial vehicles, and robots. Communication bandwidth, however, constrains scalability. We present ReVQom, a learned feature codec that preserves spatial identity while compressing intermediate features. ReVQom is an end-to-end method that compresses feature dimensions via a simple bottleneck network followed by multi-stage residual vector quantization (RVQ). This allows only per-pixel code indices to be transmitted, reducing payloads from 8192 bits per pixel (bpp) of uncompressed 32-bit float features to 6-30 bpp per agent with minimal accuracy loss. On DAIR-V2X real-world CP dataset, ReVQom achieves 273x compression at 30 bpp to 1365x compression at 6 bpp. At 18 bpp (455x), ReVQom matches or outperforms raw-feature CP, and at 6-12 bpp it enables ultra-low-bandwidth operation with graceful degradation. ReVQom allows efficient and accurate multi-agent collaborative perception with a step toward practical V2X deployment.

Paper Structure

This paper contains 9 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Each agent first extracts BEV features with a sparse voxel encoder ($\phi_\theta$) and then applies a $1 \times 1$ bottleneck represented by $f_c$ and $n_q$-stage residual vector quantization (RVQ) to produce per-pixel code indices. Transmission occurs using only indices ($\approx HW n_q \log_2 K$ bits) as opposed to full features ($32 \times C \times H \times W$ bits) where $C$ is the number of channels, $H$ is the height, and $W$ is the width of the feature map and $K$ is the codebook size in a stage. With a shared codebook, the receiver first decodes the indices and then reconstructs the full feature $\hat{F}$ with decompressor ($Decomp$). $F_c$ and $F_e$ denote raw BEV features from ego agent and collaborator agent respectively, with $\hat{F}_c$ and $\hat{F}_e$ being their reconstructed versions, and $F_f$ the final fused features. This allows an integer indices based communication between agents providing compression without overly compromising spatial fidelity.
  • Figure 2: Detection results showing improvement with increased bit rate until optimal codebook size. The top row shows 3D view from ego vehicle's forward view perspective. The bottom row shows Bird's-eye view of the same scene with the ego vehicle at the center going left to right. Green and red boxes represent ground truth and predictions respectively. As little as 6 bits (a) works reasonably well, while higher bit rate shows improvement in precise bound box alignment between the ground truth and predictions.
  • Figure 3: Learned ReVQom codebook assignment visualization ($K=4$ and first quantizer shown for clarity) revealing feature sparsity patterns. The first and last 64 channels show high spatial sparsity with concentrated activations around roads and objects. Codebook assignment (3rd column) demonstrates that Code 0 (background, ) dominates 96-98% of spatial locations for both vehicle and infrastructure agents, while Codes 1-3 (, , ) efficiently encode semantic foreground regions. This reveals both spatial sparsity (most pixels are background) and channel-wise redundancy across feature maps, enabling ReVQom's aggressive compression while preserving spatial structure for accurate fusion.
  • Figure 4: Ablation study results for number of quantization stages ($N_q$), channel reduction rate ($C_{rr}$) in logarithmic scale, and EMA decay rate ($\alpha$). $AP@0.3$ and $AP@0.5$ are indicated by (blue) and (red) respectively.