Table of Contents
Fetching ...

TrackOcc: Camera-based 4D Panoptic Occupancy Tracking

Zhuoguang Chen, Kenan Li, Xiuyu Yang, Tao Jiang, Yiming Li, Hang Zhao

TL;DR

This work defines Camera-based 4D Panoptic Occupancy Tracking, a task that jointly performs occupancy panoptic segmentation and object tracking from camera input. It proposes TrackOcc, a streaming end-to-end model that uses 4D panoptic queries (emerging and track queries) and a deformable Volume Cross-Attention decoder guided by a localization-aware loss, enabling temporally consistent panoptic labeling in 3D space. A novel OccSTQ metric combines segmentation and association quality to evaluate both pixel-voxel accuracy and cross-frame identity tracking, and the method is validated on Occ3D-Waymo where TrackOcc achieves state-of-the-art results. The approach emphasizes localization with a differentiable query propagation mechanism and demonstrates practical, real-time performance on standard hardware, advancing camera-only perception for autonomous systems.

Abstract

Comprehensive and consistent dynamic scene understanding from camera input is essential for advanced autonomous systems. Traditional camera-based perception tasks like 3D object tracking and semantic occupancy prediction lack either spatial comprehensiveness or temporal consistency. In this work, we introduce a brand-new task, Camera-based 4D Panoptic Occupancy Tracking, which simultaneously addresses panoptic occupancy segmentation and object tracking from camera-only input. Furthermore, we propose TrackOcc, a cutting-edge approach that processes image inputs in a streaming, end-to-end manner with 4D panoptic queries to address the proposed task. Leveraging the localization-aware loss, TrackOcc enhances the accuracy of 4D panoptic occupancy tracking without bells and whistles. Experimental results demonstrate that our method achieves state-of-the-art performance on the Waymo dataset. The source code will be released at https://github.com/Tsinghua-MARS-Lab/TrackOcc.

TrackOcc: Camera-based 4D Panoptic Occupancy Tracking

TL;DR

This work defines Camera-based 4D Panoptic Occupancy Tracking, a task that jointly performs occupancy panoptic segmentation and object tracking from camera input. It proposes TrackOcc, a streaming end-to-end model that uses 4D panoptic queries (emerging and track queries) and a deformable Volume Cross-Attention decoder guided by a localization-aware loss, enabling temporally consistent panoptic labeling in 3D space. A novel OccSTQ metric combines segmentation and association quality to evaluate both pixel-voxel accuracy and cross-frame identity tracking, and the method is validated on Occ3D-Waymo where TrackOcc achieves state-of-the-art results. The approach emphasizes localization with a differentiable query propagation mechanism and demonstrates practical, real-time performance on standard hardware, advancing camera-only perception for autonomous systems.

Abstract

Comprehensive and consistent dynamic scene understanding from camera input is essential for advanced autonomous systems. Traditional camera-based perception tasks like 3D object tracking and semantic occupancy prediction lack either spatial comprehensiveness or temporal consistency. In this work, we introduce a brand-new task, Camera-based 4D Panoptic Occupancy Tracking, which simultaneously addresses panoptic occupancy segmentation and object tracking from camera-only input. Furthermore, we propose TrackOcc, a cutting-edge approach that processes image inputs in a streaming, end-to-end manner with 4D panoptic queries to address the proposed task. Leveraging the localization-aware loss, TrackOcc enhances the accuracy of 4D panoptic occupancy tracking without bells and whistles. Experimental results demonstrate that our method achieves state-of-the-art performance on the Waymo dataset. The source code will be released at https://github.com/Tsinghua-MARS-Lab/TrackOcc.

Paper Structure

This paper contains 22 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison of tasks for scene understanding. (a) Outputs bounding box tracks from multi-view image sequences. (b) Predicts semantic labels for the occupancy volume from multi-view images. (c) Applies 3D panoptic segmentation on aggregated 4D LiDAR point volumes, followed by post-processing instance matching. (d) Our proposed task: Predicts temporally consistent panoptic labels of the occupancy from multi-view image sequences.
  • Figure 2: Overall pipeline of TrackOcc. At each timestep, the encoder extracts multi-view image features and transforms them into 3D volume features. These volume features serve as critical context for updating the 4D panoptic queries via a designed decoder. The query propagation module facilitates efficient object tracking in a streaming, end-to-end manner. The symbol $\bigotimes$ denotes matrix multiplication.
  • Figure 3: Qualitative results of our method and 4D-LCA. The same color indicates identical instances across different time steps. Our method effectively tracks both moving and stationary vehicles, while the 4D-LCA method encounters difficulties with moving vehicles. For clarity, only images from the front camera are displayed, although both methods process data from surrounding cameras.