Table of Contents
Fetching ...

TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

Zhiying Song, Lei Yang, Fuxi Wen, Jun Li

TL;DR

TraF-Align tackles inter-agent latency in asynchronous V2X cooperative perception by predicting object trajectories from past observations and guiding cross-agent feature interaction along those trajectories. The method introduces a field predictor to generate trajectory fields, an offset generator to produce attention sampling points, and trajectory-aware attention to align and reconstruct current-time features for robust fusion. End-to-end training employs a field loss and an offset loss (with Sinkhorn matching) to supervise trajectory alignment and attention point generation, achieving state-of-the-art results on V2V4Real and DAIR-V2X-Seq under latencies up to $400$ ms. The work enables coherent semantic fusion across frames and agents, improving detection accuracy and latency robustness, with practical impact for real-world asynchronous cooperative perception systems.

Abstract

Cooperative perception presents significant potential for enhancing the sensing capabilities of individual vehicles, however, inter-agent latency remains a critical challenge. Latencies cause misalignments in both spatial and semantic features, complicating the fusion of real-time observations from the ego vehicle with delayed data from others. To address these issues, we propose TraF-Align, a novel framework that learns the flow path of features by predicting the feature-level trajectory of objects from past observations up to the ego vehicle's current time. By generating temporally ordered sampling points along these paths, TraF-Align directs attention from the current-time query to relevant historical features along each trajectory, supporting the reconstruction of current-time features and promoting semantic interaction across multiple frames. This approach corrects spatial misalignment and ensures semantic consistency across agents, effectively compensating for motion and achieving coherent feature fusion. Experiments on two real-world datasets, V2V4Real and DAIR-V2X-Seq, show that TraF-Align sets a new benchmark for asynchronous cooperative perception.

TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

TL;DR

TraF-Align tackles inter-agent latency in asynchronous V2X cooperative perception by predicting object trajectories from past observations and guiding cross-agent feature interaction along those trajectories. The method introduces a field predictor to generate trajectory fields, an offset generator to produce attention sampling points, and trajectory-aware attention to align and reconstruct current-time features for robust fusion. End-to-end training employs a field loss and an offset loss (with Sinkhorn matching) to supervise trajectory alignment and attention point generation, achieving state-of-the-art results on V2V4Real and DAIR-V2X-Seq under latencies up to ms. The work enables coherent semantic fusion across frames and agents, improving detection accuracy and latency robustness, with practical impact for real-world asynchronous cooperative perception systems.

Abstract

Cooperative perception presents significant potential for enhancing the sensing capabilities of individual vehicles, however, inter-agent latency remains a critical challenge. Latencies cause misalignments in both spatial and semantic features, complicating the fusion of real-time observations from the ego vehicle with delayed data from others. To address these issues, we propose TraF-Align, a novel framework that learns the flow path of features by predicting the feature-level trajectory of objects from past observations up to the ego vehicle's current time. By generating temporally ordered sampling points along these paths, TraF-Align directs attention from the current-time query to relevant historical features along each trajectory, supporting the reconstruction of current-time features and promoting semantic interaction across multiple frames. This approach corrects spatial misalignment and ensures semantic consistency across agents, effectively compensating for motion and achieving coherent feature fusion. Experiments on two real-world datasets, V2V4Real and DAIR-V2X-Seq, show that TraF-Align sets a new benchmark for asynchronous cooperative perception.

Paper Structure

This paper contains 20 sections, 4 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Asynchronous observations of ego vehicle and other agents result in spatial and semantic misalignments.
  • Figure 1: Precision-Recall curve showing the ablation results of major components when the Ego vehicle experiences $0$ ms and $400$ ms delays on the DAIR-V2X-Seq dataset.
  • Figure 2: Proposed Architecture. With asynchronous LiDAR inputs from the ego agent (time $t$) and agent $i$ (time $s = t - \tau_t^i$), BEV features are extracted using an onboard sparse encoder. Agent $i$'s features are transmitted to the ego agent and stored in memory. TraF-Align predicts a trajectory field up to time $t$, generates target attention positions, and uses features at these positions as keys/values in attention layers. Finally, the multi-agent features are fused and processed by the head to generate cooperative predictions at time $t$.
  • Figure 2: Precision-Recall curve showing the ablation results of loss when the Ego vehicle experiences $0$ ms and $400$ ms delays on the DAIR-V2X-Seq dataset.
  • Figure 3: Illustration of trajectory field, which includes a position field indicating the occupancy grid of the trajectory and a direction field depicting the trajectory's flow orientation.
  • ...and 9 more figures