Table of Contents
Fetching ...

Sparse3DTrack: Monocular 3D Object Tracking Using Sparse Supervision

Nikhil Gosala, B. Ravi Kiran, Senthil Yogamani, Abhinav Valada

Abstract

Monocular 3D object tracking aims to estimate temporally consistent 3D object poses across video frames, enabling autonomous agents to reason about scene dynamics. However, existing state-of-the-art approaches are fully supervised and rely on dense 3D annotations over long video sequences, which are expensive to obtain and difficult to scale. In this work, we address this fundamental limitation by proposing the first sparsely supervised framework for monocular 3D object tracking. Our approach decomposes the task into two sequential sub-problems: 2D query matching and 3D geometry estimation. Both components leverage the spatio-temporal consistency of image sequences to augment a sparse set of labeled samples and learn rich 2D and 3D representations of the scene. Leveraging these learned cues, our model automatically generates high-quality 3D pseudolabels across entire videos, effectively transforming sparse supervision into dense 3D track annotations. This enables existing fully-supervised trackers to effectively operate under extreme label sparsity. Extensive experiments on the KITTI and nuScenes datasets demonstrate that our method significantly improves tracking performance, achieving an improvement of up to 15.50 p.p. while using at most four ground truth annotations per track.

Sparse3DTrack: Monocular 3D Object Tracking Using Sparse Supervision

Abstract

Monocular 3D object tracking aims to estimate temporally consistent 3D object poses across video frames, enabling autonomous agents to reason about scene dynamics. However, existing state-of-the-art approaches are fully supervised and rely on dense 3D annotations over long video sequences, which are expensive to obtain and difficult to scale. In this work, we address this fundamental limitation by proposing the first sparsely supervised framework for monocular 3D object tracking. Our approach decomposes the task into two sequential sub-problems: 2D query matching and 3D geometry estimation. Both components leverage the spatio-temporal consistency of image sequences to augment a sparse set of labeled samples and learn rich 2D and 3D representations of the scene. Leveraging these learned cues, our model automatically generates high-quality 3D pseudolabels across entire videos, effectively transforming sparse supervision into dense 3D track annotations. This enables existing fully-supervised trackers to effectively operate under extreme label sparsity. Extensive experiments on the KITTI and nuScenes datasets demonstrate that our method significantly improves tracking performance, achieving an improvement of up to 15.50 p.p. while using at most four ground truth annotations per track.
Paper Structure (20 sections, 5 equations, 8 figures, 5 tables)

This paper contains 20 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Sparse3DTrack: The first sparsely supervised approach for monocular 3D object tracking. Our framework couples spatio-temporal consistency of image sequences with sparse annotations (green) to generate high-quality 3D tracking pseudolabels (blue).
  • Figure 2: Overview of our Sparse3DTrack framework for sparsely supervised monocular 3D object tracking. Sparse3DTrack decomposes the overall problem into two sequential sub-tasks, namely, 2D query matching which finds object correspondences across frames and 3D geometry estimation which estimates the 3D pose and dimensions of the localized object. Our framework is extremely label-efficient and enables object tracking using at most four annotations per track.
  • Figure 3: Similarity and FNComp heatmaps computed by the $\mathcal{M}_{2d}$ and FNComp modules, respectively. Note that the similarity map focuses only on the query object in the target image, while the FNComp heatmap highlights all vehicles in the scene.
  • Figure 4: Illustration of data mining strategies used to tackle label sparsity in $\mathcal{M}_{2d}$. In this figure, green and red boxes represent labeled and unlabeled image frames, respectively.
  • Figure 5: Qualitative results of the 3D pseudolabels generated by Sparse3DTrack and the corresponding monocular 3D tracking performance when CenterTrack is trained using these pseudolabels. Each object track is visualized in a unique color, and its previous trajectory is illustrated using a sequence of dots of the same color. Note from (a, b) that Sparse3DTrack generates accurate and temporally consistent 3D psuedolabels even when objects undergo total occlusion, allowing CenterTrack to effectively track objects on unseen image sequences (c, d).
  • ...and 3 more figures