Table of Contents
Fetching ...

A Prediction-as-Perception Framework for 3D Object Detection

Song Zhang, Haoyu Chen, Ruibo Wang

Abstract

Humans combine prediction and perception to observe the world. When faced with rapidly moving birds or insects, we can only perceive them clearly by predicting their next position and focusing our gaze there. Inspired by this, this paper proposes the Prediction-As-Perception (PAP) framework, integrating a prediction-perception architecture into 3D object perception tasks to enhance the model's perceptual accuracy. The PAP framework consists of two main modules: prediction and perception, primarily utilizing continuous frame information as input. Firstly, the prediction module forecasts the potential future positions of ego vehicles and surrounding traffic participants based on the perception results of the current frame. These predicted positions are then passed as queries to the perception module of the subsequent frame. The perceived results are iteratively fed back into the prediction module. We evaluated the PAP structure using the end-to-end model UniAD on the nuScenes dataset. The results demonstrate that the PAP structure improves UniAD's target tracking accuracy by 10% and increases the inference speed by 15%. This indicates that such a biomimetic design significantly enhances the efficiency and accuracy of perception models while reducing computational resource consumption.

A Prediction-as-Perception Framework for 3D Object Detection

Abstract

Humans combine prediction and perception to observe the world. When faced with rapidly moving birds or insects, we can only perceive them clearly by predicting their next position and focusing our gaze there. Inspired by this, this paper proposes the Prediction-As-Perception (PAP) framework, integrating a prediction-perception architecture into 3D object perception tasks to enhance the model's perceptual accuracy. The PAP framework consists of two main modules: prediction and perception, primarily utilizing continuous frame information as input. Firstly, the prediction module forecasts the potential future positions of ego vehicles and surrounding traffic participants based on the perception results of the current frame. These predicted positions are then passed as queries to the perception module of the subsequent frame. The perceived results are iteratively fed back into the prediction module. We evaluated the PAP structure using the end-to-end model UniAD on the nuScenes dataset. The results demonstrate that the PAP structure improves UniAD's target tracking accuracy by 10% and increases the inference speed by 15%. This indicates that such a biomimetic design significantly enhances the efficiency and accuracy of perception models while reducing computational resource consumption.
Paper Structure (14 sections, 3 equations, 3 figures, 2 tables)

This paper contains 14 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Fig. 1. An Overall architecture of PAP. The PAP framework consists of perception and prediction moules. The input to the perception module is the current frame's image and queries, which include both randomly generated queries for the current frame and queries generated from the output results of the prediction module from the previous frame. The input to the pre-diction module is the output results of the perception module, and the output is the possible future positions of traffic participants in the current frame. These position coordinates, once converted into queries, are stored in the queries bank for future frame calls.
  • Figure 2: Fig. 2. Perception module of PAP. The perception module primarily uses queries to detect objects. Let's assume the perception module has a structure similar to DETR3D[13]. The queries input to the perception module consist of two parts: one part generated from the prediction results of the previous frame, and another part randomly generated for the current frame. When the positions of the queries generated from the previous frame's prediction results are projected onto the map, it is evident that compared to the randomly generated queries, the former are closer to the locations of traffic participants in the current environment.
  • Figure 3: Fig. 3. PAP with UniAD[9]. In UniAD, the interaction between modules is inherently based on queries. Therefore, we only need to take the queries output by the Motion Former module in UniAD, embed them to match the dimensions of the Track Queries, and then feed them together with the Track Queries into Track Former for detection.