Table of Contents
Fetching ...

STEP: Segmenting and Tracking Every Pixel

Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender, Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers, Aljoša Ošep, Laura Leal-Taixé, Liang-Chieh Chen

TL;DR

This work tackles dense, pixel-precise video understanding by introducing STEP, a real-world benchmark built on KITTI-STEP and MOTChallenge-STEP to enable long-term pixel-level segmentation and tracking. It proposes STQ, a metric that jointly assesses segmentation and tracking by computing AQ (association quality) and SQ (segmentation quality) and taking their geometric mean, $STQ = \sqrt{AQ \times SQ}$, while enforcing pixel-level evaluation across entire videos and decoupling semantic labeling from tracking IDs. The authors provide semi-automatic, crowd-augmented annotations merged with MOTS ground-truth, establish baselines spanning single-frame and multi-frame models (including Motion-DeepLab and VPSNet), and show STQ better captures both aspects than existing metrics like VPQ and PTQ. The dataset and metric offer a practical test-bed for long-horizon dense video understanding and can drive development of unified models that simultaneously optimize segmentation and tracking in real-world conditions.

Abstract

The task of assigning semantic classes and track identities to every pixel in a video is called video panoptic segmentation. Our work is the first that targets this task in a real-world setting requiring dense interpretation in both spatial and temporal domains. As the ground-truth for this task is difficult and expensive to obtain, existing datasets are either constructed synthetically or only sparsely annotated within short video clips. To overcome this, we introduce a new benchmark encompassing two datasets, KITTI-STEP, and MOTChallenge-STEP. The datasets contain long video sequences, providing challenging examples and a test-bed for studying long-term pixel-precise segmentation and tracking under real-world conditions. We further propose a novel evaluation metric Segmentation and Tracking Quality (STQ) that fairly balances semantic and tracking aspects of this task and is more appropriate for evaluating sequences of arbitrary length. Finally, we provide several baselines to evaluate the status of existing methods on this new challenging dataset. We have made our datasets, metric, benchmark servers, and baselines publicly available, and hope this will inspire future research.

STEP: Segmenting and Tracking Every Pixel

TL;DR

This work tackles dense, pixel-precise video understanding by introducing STEP, a real-world benchmark built on KITTI-STEP and MOTChallenge-STEP to enable long-term pixel-level segmentation and tracking. It proposes STQ, a metric that jointly assesses segmentation and tracking by computing AQ (association quality) and SQ (segmentation quality) and taking their geometric mean, , while enforcing pixel-level evaluation across entire videos and decoupling semantic labeling from tracking IDs. The authors provide semi-automatic, crowd-augmented annotations merged with MOTS ground-truth, establish baselines spanning single-frame and multi-frame models (including Motion-DeepLab and VPSNet), and show STQ better captures both aspects than existing metrics like VPQ and PTQ. The dataset and metric offer a practical test-bed for long-horizon dense video understanding and can drive development of unified models that simultaneously optimize segmentation and tracking in real-world conditions.

Abstract

The task of assigning semantic classes and track identities to every pixel in a video is called video panoptic segmentation. Our work is the first that targets this task in a real-world setting requiring dense interpretation in both spatial and temporal domains. As the ground-truth for this task is difficult and expensive to obtain, existing datasets are either constructed synthetically or only sparsely annotated within short video clips. To overcome this, we introduce a new benchmark encompassing two datasets, KITTI-STEP, and MOTChallenge-STEP. The datasets contain long video sequences, providing challenging examples and a test-bed for studying long-term pixel-precise segmentation and tracking under real-world conditions. We further propose a novel evaluation metric Segmentation and Tracking Quality (STQ) that fairly balances semantic and tracking aspects of this task and is more appropriate for evaluating sequences of arbitrary length. Finally, we provide several baselines to evaluate the status of existing methods on this new challenging dataset. We have made our datasets, metric, benchmark servers, and baselines publicly available, and hope this will inspire future research.

Paper Structure

This paper contains 19 sections, 16 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Our proposed ground-truth labels of KITTI-STEP (top) and MOTChallenge-STEP (bottom).
  • Figure 2: Annotation process: The machine annotation semantic segmentation from Panoptic-DeepLab is corrected by human annotators with multiple refinements. The resulting annotation is further merged with the existing instance ground-truth from KITTI-MOTS and MOTS-Challenge.
  • Figure 3: Label distribution in KITTI-STEP and MOTChallenge-STEP.
  • Figure 4: Dataset statistics, comparison and track length distribution of KITTI-STEP.
  • Figure 5: An illustration of association precision, association recall and the removal of correct segments with wrong track ID for tracks of up to 5 frames. Each car is in a single-frame, where colors encode track IDs. We assume perfect segmentation and show matched tracks. For example, the left scenario contains two ground-truth tracks (orange, blue), while the prediction contains a single track (violet) that overlaps with both ground-truth tracks. Here, only the change of colors is important. Predictions should ideally have color transitions at the same frames as the ground-truth, if any. VPQ$^{\dagger}$ refers to the VPQ score when evaluated on full videos instead of small spans. STQ is the only metric that properly penalizes ID transfer (#1, P4), encourages long-term track consistency (#3 $>$ #2, P4), and reduces the score when removing semantically correct predictions (#4 $>$ #5, P5).
  • ...and 3 more figures