Table of Contents
Fetching ...

Timealign: A multi-modal object detection method for time misalignment fusing in autonomous driving

Zhihang Song, Lihui Peng, Jianming Hu, Danya Yao, Yi Zhang

TL;DR

The paper tackles temporal misalignment caused by LiDAR data lag in multi-modal BEV-based 3D object detection for autonomous driving. It introduces TimeAlign, which predicts current LiDAR BEV features from historical frames using a Swin-LSTM, and fuses these predictions with observed LiDAR features under camera guidance via a dual-transformer and deformable convolution mechanism within GraphBEV. Empirical results on nuScenes show that TimeAlign improves detection under LiDAR lag compared to a lag-aware GraphBEV baseline, while acknowledging performance gaps when data are perfectly synchronized due to prediction errors and limited historical frames. The work advances robust multi-modal fusion by explicitly addressing temporal misalignment and lays groundwork for further integration of temporal prediction in BEV-based perception systems.

Abstract

The multi-modal perception methods are thriving in the autonomous driving field due to their better usage of complementary data from different sensors. Such methods depend on calibration and synchronization between sensors to get accurate environmental information. There have already been studies about space-alignment robustness in autonomous driving object detection process, however, the research for time-alignment is relatively few. As in reality experiments, LiDAR point clouds are more challenging for real-time data transfer, our study used historical frames of LiDAR to better align features when the LiDAR data lags exist. We designed a Timealign module to predict and combine LiDAR features with observation to tackle such time misalignment based on SOTA GraphBEV framework.

Timealign: A multi-modal object detection method for time misalignment fusing in autonomous driving

TL;DR

The paper tackles temporal misalignment caused by LiDAR data lag in multi-modal BEV-based 3D object detection for autonomous driving. It introduces TimeAlign, which predicts current LiDAR BEV features from historical frames using a Swin-LSTM, and fuses these predictions with observed LiDAR features under camera guidance via a dual-transformer and deformable convolution mechanism within GraphBEV. Empirical results on nuScenes show that TimeAlign improves detection under LiDAR lag compared to a lag-aware GraphBEV baseline, while acknowledging performance gaps when data are perfectly synchronized due to prediction errors and limited historical frames. The work advances robust multi-modal fusion by explicitly addressing temporal misalignment and lays groundwork for further integration of temporal prediction in BEV-based perception systems.

Abstract

The multi-modal perception methods are thriving in the autonomous driving field due to their better usage of complementary data from different sensors. Such methods depend on calibration and synchronization between sensors to get accurate environmental information. There have already been studies about space-alignment robustness in autonomous driving object detection process, however, the research for time-alignment is relatively few. As in reality experiments, LiDAR point clouds are more challenging for real-time data transfer, our study used historical frames of LiDAR to better align features when the LiDAR data lags exist. We designed a Timealign module to predict and combine LiDAR features with observation to tackle such time misalignment based on SOTA GraphBEV framework.

Paper Structure

This paper contains 10 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The whole structure of TimeAlign model.
  • Figure 2: Structure of Swin-LSTM LiDAR feature prediction module in TimeAlign.
  • Figure 3: Structure of combination layers of predicted LiDAR feature and observed ones in TimeAlign.