Table of Contents
Fetching ...

Distilling Future Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection

Haowen Zheng, Hu Zhu, Lu Deng, Weihao Gu, Yang Yang, Yanyan Liang

TL;DR

FTKD introduces a camera-only, sparse-query knowledge distillation framework that enables online 3D object detectors to learn from future-frame information without increasing inference cost. It combines future-aware feature reconstruction and future-guided logit distillation to relax strict frame alignment and to utilize background cues, improving detection, especially for occluded and distant objects. Evaluations on nuScenes show consistent gains across two strong baselines, including mAP and NDS improvements and superior velocity estimation. The approach maintains efficiency and demonstrates the value of leveraging future context in online perception systems.

Abstract

Camera-based temporal 3D object detection has shown impressive results in autonomous driving, with offline models improving accuracy by using future frames. Knowledge distillation (KD) can be an appealing framework for transferring rich information from offline models to online models. However, existing KD methods overlook future frames, as they mainly focus on spatial feature distillation under strict frame alignment or on temporal relational distillation, thereby making it challenging for online models to effectively learn future knowledge. To this end, we propose a sparse query-based approach, Future Temporal Knowledge Distillation (FTKD), which effectively transfers future frame knowledge from an offline teacher model to an online student model. Specifically, we present a future-aware feature reconstruction strategy to encourage the student model to capture future features without strict frame alignment. In addition, we further introduce future-guided logit distillation to leverage the teacher's stable foreground and background context. FTKD is applied to two high-performing 3D object detection baselines, achieving up to 1.3 mAP and 1.3 NDS gains on the nuScenes dataset, as well as the most accurate velocity estimation, without increasing inference cost.

Distilling Future Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection

TL;DR

FTKD introduces a camera-only, sparse-query knowledge distillation framework that enables online 3D object detectors to learn from future-frame information without increasing inference cost. It combines future-aware feature reconstruction and future-guided logit distillation to relax strict frame alignment and to utilize background cues, improving detection, especially for occluded and distant objects. Evaluations on nuScenes show consistent gains across two strong baselines, including mAP and NDS improvements and superior velocity estimation. The approach maintains efficiency and demonstrates the value of leveraging future context in online perception systems.

Abstract

Camera-based temporal 3D object detection has shown impressive results in autonomous driving, with offline models improving accuracy by using future frames. Knowledge distillation (KD) can be an appealing framework for transferring rich information from offline models to online models. However, existing KD methods overlook future frames, as they mainly focus on spatial feature distillation under strict frame alignment or on temporal relational distillation, thereby making it challenging for online models to effectively learn future knowledge. To this end, we propose a sparse query-based approach, Future Temporal Knowledge Distillation (FTKD), which effectively transfers future frame knowledge from an offline teacher model to an online student model. Specifically, we present a future-aware feature reconstruction strategy to encourage the student model to capture future features without strict frame alignment. In addition, we further introduce future-guided logit distillation to leverage the teacher's stable foreground and background context. FTKD is applied to two high-performing 3D object detection baselines, achieving up to 1.3 mAP and 1.3 NDS gains on the nuScenes dataset, as well as the most accurate velocity estimation, without increasing inference cost.

Paper Structure

This paper contains 19 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of using future frames in feature distillation. (a) Spatial feature distillation requires strict alignment of input frames between the teacher and student models, preventing the use of future frame information. (b) Temporal relational distillation focuses on inter-frame relational knowledge but overlooks future frames. (c) In FTKD, information from future frames is aggregated temporally and used as the reconstruction objective for student's masked feature, facilitating effective learning of future knowledge.
  • Figure 2: Overall framework of Future Temporal Knowledge Distillation (FTKD). FTKD consists of two core distillation components: future-aware feature reconstruction (FFR) and future-guided logit distillation (FLD), which facilitate the transfer of future knowledge from the offline teacher to the online student model. Specifically, FFR conducts masked feature reconstruction on perspective features and sparse BEV query features, while FLD guides the student in capturing both foreground and background cues embedded in the sparse queries.
  • Figure 3: Visualization of sparse queries (a) with and (b) without future-aware feature reconstruction (FFR). Larger points denote shallower depth. It is evident that, with FFR, the sparse queries are more aligned with the ground truth.
  • Figure 4: Qualitative results over three consecutive frames (front camera) in two scenes. The first and third row show the prediction made by the baseline model, while the second and fourth row demonstrate the predictive results of FTKD. In the last column, the LiDAR point cloud in BEV is display for frame $t+1$, except the last row (for $t+2$) due to the limited BEV distance. FTKD successfully predicts an occluded car merging into the main road and a pedestrian crossing the street in the distance, highlighted by red dotted circles.