Table of Contents
Fetching ...

PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

Kuan-Chih Huang, Weijie Lyu, Ming-Hsuan Yang, Yi-Hsuan Tsai

TL;DR

We address efficient temporal 3D object detection for LiDAR by reducing memory needs when using multi-frame data. The proposed Point-Trajectory Transformer (PTT) fuses a current-frame object point cloud with multi-frame proposal trajectories through dedicated long-term, short-term, and future-aware encoders, connected by a point-trajectory aggregator. Through memory-budget analysis and extensive Waymo Open Dataset experiments, PTT demonstrates competitive accuracy while requiring far less memory than prior methods and enabling longer temporal windows. The results indicate that interactions between point and trajectory features, and future-aware trajectory encoding, are key for robust online temporal detection in resource-constrained environments.

Abstract

Recent temporal LiDAR-based 3D object detectors achieve promising performance based on the two-stage proposal-based approach. They generate 3D box candidates from the first-stage dense detector, followed by different temporal aggregation methods. However, these approaches require per-frame objects or whole point clouds, posing challenges related to memory bank utilization. Moreover, point clouds and trajectory features are combined solely based on concatenation, which may neglect effective interactions between them. In this paper, we propose a point-trajectory transformer with long short-term memory for efficient temporal 3D object detection. To this end, we only utilize point clouds of current-frame objects and their historical trajectories as input to minimize the memory bank storage requirement. Furthermore, we introduce modules to encode trajectory features, focusing on long short-term and future-aware perspectives, and then effectively aggregate them with point cloud features. We conduct extensive experiments on the large-scale Waymo dataset to demonstrate that our approach performs well against state-of-the-art methods. Code and models will be made publicly available at https://github.com/kuanchihhuang/PTT.

PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

TL;DR

We address efficient temporal 3D object detection for LiDAR by reducing memory needs when using multi-frame data. The proposed Point-Trajectory Transformer (PTT) fuses a current-frame object point cloud with multi-frame proposal trajectories through dedicated long-term, short-term, and future-aware encoders, connected by a point-trajectory aggregator. Through memory-budget analysis and extensive Waymo Open Dataset experiments, PTT demonstrates competitive accuracy while requiring far less memory than prior methods and enabling longer temporal windows. The results indicate that interactions between point and trajectory features, and future-aware trajectory encoding, are key for robust online temporal detection in resource-constrained environments.

Abstract

Recent temporal LiDAR-based 3D object detectors achieve promising performance based on the two-stage proposal-based approach. They generate 3D box candidates from the first-stage dense detector, followed by different temporal aggregation methods. However, these approaches require per-frame objects or whole point clouds, posing challenges related to memory bank utilization. Moreover, point clouds and trajectory features are combined solely based on concatenation, which may neglect effective interactions between them. In this paper, we propose a point-trajectory transformer with long short-term memory for efficient temporal 3D object detection. To this end, we only utilize point clouds of current-frame objects and their historical trajectories as input to minimize the memory bank storage requirement. Furthermore, we introduce modules to encode trajectory features, focusing on long short-term and future-aware perspectives, and then effectively aggregate them with point cloud features. We conduct extensive experiments on the large-scale Waymo dataset to demonstrate that our approach performs well against state-of-the-art methods. Code and models will be made publicly available at https://github.com/kuanchihhuang/PTT.
Paper Structure (13 sections, 8 equations, 3 figures, 8 tables)

This paper contains 13 sections, 8 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Different approaches for temporal 3D object detection. (a) Existing methods mppnetmsf require per-frame point clouds as input, resulting in more memory overhead. In addition, the straightforward concatenation of point and trajectory features overlooks their interactions across features. (b) Our approach minimizes the space requirement for the memory bank by utilizing only the current frame's point cloud as input. Moreover, we introduce a point-trajectory transformer (PTT) to effectively integrate point and trajectory features. Note that the gray point clouds indicate that they are not utilized).
  • Figure 2: Overall framework of the proposed Point-Trajectory Transformer (PTT). First, we utilize a region proposal network (RPN) at timestamp $T$ to generate proposals $\mathbf{B}^T$ for each frame, sample the corresponding point-of-interest $\mathbf{I}^T$, and connect past $T$-frame 3D proposals to form proposal trajectories $\{\mathbf{B}^1,...,\mathbf{B}^T\}$. Then, we take the single-frame point cloud for each object and its previous multi-frame trajectory as input to generate point-trajectory features ${\mathbf P}^t$, which avoid storing per-frame points to mitigate memory bank's overhead (Section \ref{['sec:feat']}). Finally, we present a point-trajectory transformer (PTT) to fuse features, which consists of four components: Long-term and Short-term encoders for extracting two types of features, a future-aware module for extracting future-aware point features, and a point-trajectory aggregator for adaptive interaction between trajectory and current frame's point features ${\mathbf G}^T$ (Section \ref{['sec:ptt']}).
  • Figure 3: Point-Trajectory Aggregator. Current-frame point cloud features $\mathbf{G}^T$ are squeezed and interact with long-term memory $\hat{{\mathbf M}}_l$, short-term memory $\hat{{\mathbf M}}_s$, and future memory $\hat{{\mathbf M}}'_f$. See Section \ref{['sec:ptt']} for more details.