Table of Contents
Fetching ...

GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space

Wentao Wang, Haoran Xu, Guang Tan

Abstract

In autonomous driving, multi-agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling {\em heterogeneous} features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose {\em GT-Space}, a flexible and scalable collaborative perception framework for heterogeneous agents. GT-Space constructs a common feature space from ground-truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore, we design a fusion network trained with contrastive losses across diverse modality combinations. Extensive experiments on simulation datasets (OPV2V and V2XSet) and a real-world dataset (RCooper) demonstrate that GT-Space consistently outperforms baselines in detection accuracy while delivering robust performance. Our code will be released at https://github.com/KingScar/GT-Space.

GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space

Abstract

In autonomous driving, multi-agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling {\em heterogeneous} features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose {\em GT-Space}, a flexible and scalable collaborative perception framework for heterogeneous agents. GT-Space constructs a common feature space from ground-truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore, we design a fusion network trained with contrastive losses across diverse modality combinations. Extensive experiments on simulation datasets (OPV2V and V2XSet) and a real-world dataset (RCooper) demonstrate that GT-Space consistently outperforms baselines in detection accuracy while delivering robust performance. Our code will be released at https://github.com/KingScar/GT-Space.
Paper Structure (20 sections, 12 equations, 10 figures, 13 tables)

This paper contains 20 sections, 12 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Comparison of heterogeneous collaborative strategies. (a) Retraining of the encoders: the fusion network is frozen while the local encoder (denoted as E) is retrained to adapt to data integration; (b) Interpreter-based alignment: local encoder and detection head (denoted as D) are frozen, the interpreter can project the received features into the ego’s feature space; (c) Our method does not require any re-training of encoders or heads, and utilizes a modality-agnostic fusion module to integrate heterogeneous modalities using a ground-truth derived common feature space for alignment.
  • Figure 2: The framework of GT-Space. Heterogeneous features undergo domain conversion before being input into the fusion network. The ground truth object bounding box labels are leveraged to generate a common feature space. Heterogeneous features from different agents are projected into the common feature space for alignment and fusion.
  • Figure 3: Feature alignment in GT-Space. (a) Ground truth features generation: The object bounding box labels are encoded and mapped onto the BEV map; (b) Multi-modal transformer: Each modality is mapped into a common feature space by a specific projector; (c) Combinatorial contrastive learning: M1, M2, and M3 denote three different modalities, and computing contrastive loss over all pairs of these modalities enhances the model’s general feature extraction ability; (d) Object-level alignment: Contrastive learning pulls features of the same object closer and pushes those of different objects apart.
  • Figure 4: Performance of under-performing agents on OPV2V. The left column represents Agent 1, and the right column Agent 3. The horizontal axis represents performance without collaboration, while the vertical axis represents collaboration.
  • Figure 5: Robust Experiment to pose error and communication latency.
  • ...and 5 more figures