Table of Contents
Fetching ...

CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception

Gong Chen, Chaokun Zhang, Tao Tang, Pengcheng Lv, Feng Li, Xin Xie

TL;DR

This work proposes Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems and demonstrates its superior robustness and adaptability.

Abstract

Cooperative perception significantly enhances scene understanding by integrating complementary information from diverse agents. However, existing research often overlooks critical challenges inherent in real-world multi-source data integration, specifically high temporal latency and multi-source noise. To address these practical limitations, we propose Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems. Our key innovations can be summarized in three aspects. First, we introduce a Spatio-Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams via adjacent-frame differential modeling, establishing a temporal-spatially unified representation space. Second, we design a Dual-Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions within aligned representations. Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion. Extensive experiments on multiple datasets demonstrate that CATNet consistently outperforms existing methods under complex traffic conditions, proving its superior robustness and adaptability.

CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception

TL;DR

This work proposes Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems and demonstrates its superior robustness and adaptability.

Abstract

Cooperative perception significantly enhances scene understanding by integrating complementary information from diverse agents. However, existing research often overlooks critical challenges inherent in real-world multi-source data integration, specifically high temporal latency and multi-source noise. To address these practical limitations, we propose Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems. Our key innovations can be summarized in three aspects. First, we introduce a Spatio-Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams via adjacent-frame differential modeling, establishing a temporal-spatially unified representation space. Second, we design a Dual-Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions within aligned representations. Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion. Extensive experiments on multiple datasets demonstrate that CATNet consistently outperforms existing methods under complex traffic conditions, proving its superior robustness and adaptability.
Paper Structure (13 sections, 20 equations, 8 figures, 6 tables)

This paper contains 13 sections, 20 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Schematic of cooperative perception under latency and noise.
  • Figure 2: Architectural overview of the CATNet framework. The ego vehicle maintains a fusion feature bank that processes multi-agent transmitted features, employing temporal prediction mechanisms for current-time feature estimation. These refined features subsequently undergo denoising operations and salient feature extraction, achieving robust cross-agent feature fusion via learnable aggregation weights.
  • Figure 3: Architecture of the proposed STSync module. The module employs adaptive integration for preliminary multi-agent feature fusion, coupled with TARU that incorporates ego-vehicle features to achieve temporally coherent feature prediction.
  • Figure 4: Architecture of the WTDen module. The module performs global and local feature denoising via a dual-branch design integrating wavelet-Mamba and wavelet convolution.
  • Figure 5: Architecture of the AdpSel module. The module divides features through a Selector, applies the MLLA module to enhance critical features, and employs inverted bottleneck convolution for feature compensation in non-critical regions. For critical features, these operations are iteratively applied, and the results are fused.
  • ...and 3 more figures