Table of Contents
Fetching ...

AgentAlign: Misalignment-Adapted Multi-Agent Perception for Resilient Inter-Agent Sensor Correlations

Zonglin Meng, Yun Zhang, Zhaoliang Zheng, Zhihao Zhao, Jiaqi Ma

TL;DR

AgentAlign tackles the fragility of inter-agent sensor correlations in real-world cooperative perception by introducing a cross-modality feature alignment space (CFAS) and a heterogeneous agent feature alignment (HAFA) mechanism to adaptively harmonize multi-modal features across diverse V2X agents. The framework densifies LiDAR-Camera interactions through a depth variation map and attention-guided fusion, enabling resilient inter-agent sensing under multifactorial noise. A novel V2XSet-Noise dataset is built to systematically evaluate robustness to calibration errors, wind-induced vibration, perspective distortions, time synchronization, and systematic biases. Empirical results on V2X-Real and V2XSet-Noise benchmarks demonstrate state-of-the-art performance and strong robustness, supported by extensive ablations that isolate the contribution of CFAS, HAFA, and depth-based alignment. The work advances practical cooperative perception for autonomous infrastructure and vehicles and provides a controllable evaluation platform for real-world sensor imperfections.

Abstract

Cooperative perception has attracted wide attention given its capability to leverage shared information across connected automated vehicles (CAVs) and smart infrastructures to address sensing occlusion and range limitation issues. However, existing research overlooks the fragile multi-sensor correlations in multi-agent settings, as the heterogeneous agent sensor measurements are highly susceptible to environmental factors, leading to weakened inter-agent sensor interactions. The varying operational conditions and other real-world factors inevitably introduce multifactorial noise and consequentially lead to multi-sensor misalignment, making the deployment of multi-agent multi-modality perception particularly challenging in the real world. In this paper, we propose AgentAlign, a real-world heterogeneous agent cross-modality feature alignment framework, to effectively address these multi-modality misalignment issues. Our method introduces a cross-modality feature alignment space (CFAS) and heterogeneous agent feature alignment (HAFA) mechanism to harmonize multi-modality features across various agents dynamically. Additionally, we present a novel V2XSet-noise dataset that simulates realistic sensor imperfections under diverse environmental conditions, facilitating a systematic evaluation of our approach's robustness. Extensive experiments on the V2X-Real and V2XSet-Noise benchmarks demonstrate that our framework achieves state-of-the-art performance, underscoring its potential for real-world applications in cooperative autonomous driving. The controllable V2XSet-Noise dataset and generation pipeline will be released in the future.

AgentAlign: Misalignment-Adapted Multi-Agent Perception for Resilient Inter-Agent Sensor Correlations

TL;DR

AgentAlign tackles the fragility of inter-agent sensor correlations in real-world cooperative perception by introducing a cross-modality feature alignment space (CFAS) and a heterogeneous agent feature alignment (HAFA) mechanism to adaptively harmonize multi-modal features across diverse V2X agents. The framework densifies LiDAR-Camera interactions through a depth variation map and attention-guided fusion, enabling resilient inter-agent sensing under multifactorial noise. A novel V2XSet-Noise dataset is built to systematically evaluate robustness to calibration errors, wind-induced vibration, perspective distortions, time synchronization, and systematic biases. Empirical results on V2X-Real and V2XSet-Noise benchmarks demonstrate state-of-the-art performance and strong robustness, supported by extensive ablations that isolate the contribution of CFAS, HAFA, and depth-based alignment. The work advances practical cooperative perception for autonomous infrastructure and vehicles and provides a controllable evaluation platform for real-world sensor imperfections.

Abstract

Cooperative perception has attracted wide attention given its capability to leverage shared information across connected automated vehicles (CAVs) and smart infrastructures to address sensing occlusion and range limitation issues. However, existing research overlooks the fragile multi-sensor correlations in multi-agent settings, as the heterogeneous agent sensor measurements are highly susceptible to environmental factors, leading to weakened inter-agent sensor interactions. The varying operational conditions and other real-world factors inevitably introduce multifactorial noise and consequentially lead to multi-sensor misalignment, making the deployment of multi-agent multi-modality perception particularly challenging in the real world. In this paper, we propose AgentAlign, a real-world heterogeneous agent cross-modality feature alignment framework, to effectively address these multi-modality misalignment issues. Our method introduces a cross-modality feature alignment space (CFAS) and heterogeneous agent feature alignment (HAFA) mechanism to harmonize multi-modality features across various agents dynamically. Additionally, we present a novel V2XSet-noise dataset that simulates realistic sensor imperfections under diverse environmental conditions, facilitating a systematic evaluation of our approach's robustness. Extensive experiments on the V2X-Real and V2XSet-Noise benchmarks demonstrate that our framework achieves state-of-the-art performance, underscoring its potential for real-world applications in cooperative autonomous driving. The controllable V2XSet-Noise dataset and generation pipeline will be released in the future.

Paper Structure

This paper contains 22 sections, 26 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: V2X agents are typically affected by various types of noise. In V2X settings, each agent encounters unique noise depending on its operational context as shown in Fig.(a). In Fig.(b), we illustrate the misalignment between the camera and projected LiDAR point clouds caused by calibration errors. Fig.(c) shows the misalignment of LiDAR point clouds on infrastructure due to vibration. In Fig.(d), a common calibration error can result in 10% drop on AP@0.7 in both OPV2V and V2XSet, in which clean means that there are no calibration errors.
  • Figure 2: The detailed structure of our proposed cooperative perception framework. In AgentAlign, connected vehicles and infrastructures communicate with each other to share the metadata and sensor measurements. Multifactorial noise often originates from agent-specific factors and propagates to other agents during communication. We propose the heterogeneous agent feature alignment method (HAFA) to dynamically align the misaligned sensor representations in the cross-modality feature alignment space (CFAS), where we construct the depth variation map from the LiDAR data for improved feature alignment. Our HAFA dynamically balances the contributions from different modality features for each unique agent. Then, each agent performs the multi-agent fusion to generate the 3D bounding boxes. We evaluate our model with real-world multi-agent data containing various types of real-world noise. Also, to systematically evaluate the impact of each type of noise, we evaluate our proposed framework on a new large-scale open dataset, namely V2XSet-Noise, to explicitly consider real-world sensor noise in V2X communication scenarios.
  • Figure 3: The detailed structure of our proposed cooperative perception network. The proposed AgentAlign consists of feature construction, cross-modality feature alignment space, heterogeneous agent feature alignment, and multi-agent fusion. Initially, multi-modality features are extracted from both camera and LiDAR data, followed by constructing the depth variation map within CFAS. Subsequently, the HAFA mechanism dynamically selects and aligns relevant features from both the camera output and the LiDAR depth variation map. The generated features are then transformed into a bird’s-eye-view (BEV) perspective and concatenated with LiDAR features, creating an aligned feature set for each agent. Finally, each agent’s aligned feature is processed by a transformer encoder and forwarded to the detection head, which produces the 3D bounding boxes.
  • Figure 4: The decomposed noise study. We analyze the impact of each decomposed noise by comparing the cooperative perception performance using our aligned features against the misaligned features in the V2XSet-Noise dataset.
  • Figure 5: Visual representation of Wind-introduced vibration The red points indicate the original point cloud data, while the green points visualize the data after applying wind-introduced vibration. In the figure, we magnify specific regions to clearly illustrate the impact of vibration on the sensor data.
  • ...and 4 more figures