Table of Contents
Fetching ...

Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation

Finlay G. C. Hudson, William A. P. Smith

TL;DR

The paper tackles zero-shot amodal video segmentation by leveraging a single visible query mask and temporal context to infer complete object shapes behind occluders. It introduces TABE, a diffusion-based outpainting pipeline guided by per-frame target regions derived from depth and approximate amodal boxes, and TABE-51, a realistic occlusion-rich dataset created via compositing real video clips. A dedicated occlusion reasoning module and finetuning regime improve temporal coherence and amodal fidelity, with evaluation metrics that isolate amodal completion. Results show TABE consistently outperforms contemporary baselines across occlusion-focused metrics, highlighting the practical potential for robust occlusion handling in video understanding tasks.

Abstract

We present Track Anything Behind Everything (TABE), a novel dataset, pipeline, and evaluation framework for zero-shot amodal completion from visible masks. Unlike existing methods that require pretrained class labels, our approach uses a single query mask from the first frame where the object is visible, enabling flexible, zero-shot inference. Our dataset, TABE-51 provides highly accurate ground truth amodal segmentation masks without the need for human estimation or 3D reconstruction. Our TABE pipeline is specifically designed to handle amodal completion, even in scenarios where objects are completely occluded. We also introduce a specialised evaluation framework that isolates amodal completion performance, free from the influence of traditional visual segmentation metrics.

Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation

TL;DR

The paper tackles zero-shot amodal video segmentation by leveraging a single visible query mask and temporal context to infer complete object shapes behind occluders. It introduces TABE, a diffusion-based outpainting pipeline guided by per-frame target regions derived from depth and approximate amodal boxes, and TABE-51, a realistic occlusion-rich dataset created via compositing real video clips. A dedicated occlusion reasoning module and finetuning regime improve temporal coherence and amodal fidelity, with evaluation metrics that isolate amodal completion. Results show TABE consistently outperforms contemporary baselines across occlusion-focused metrics, highlighting the practical potential for robust occlusion handling in video understanding tasks.

Abstract

We present Track Anything Behind Everything (TABE), a novel dataset, pipeline, and evaluation framework for zero-shot amodal completion from visible masks. Unlike existing methods that require pretrained class labels, our approach uses a single query mask from the first frame where the object is visible, enabling flexible, zero-shot inference. Our dataset, TABE-51 provides highly accurate ground truth amodal segmentation masks without the need for human estimation or 3D reconstruction. Our TABE pipeline is specifically designed to handle amodal completion, even in scenarios where objects are completely occluded. We also introduce a specialised evaluation framework that isolates amodal completion performance, free from the influence of traditional visual segmentation metrics.

Paper Structure

This paper contains 13 sections, 7 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Given an input video (top) and a prompt (e.g. point clicks) to define a query mask (top, white) we track segmentation masks of the visible regions (modal masks) using a video object segmentation method such as SAM 2 ravi2024sam (top, red). We propose to use a fine-tuned generative video diffusion model to outpaint the occluded object (middle) providing zero-shot, amodal video object segmentation (bottom).
  • Figure 2: Overview of our TABE pipeline. This figure demonstrates how input frames from a video, combined with a single query mask, are processed to produce high-quality, amodal completion segmentation masks
  • Figure 3: An overview of our dataset curation approach: from a static camera we observe two different scenes, shown in the top and middle row. We then extract the door from the middle scene and composite it onto the top scene, creating a realistic test scene (bottom row) while maintaining a ground truth accurate amodal segmentation mask.
  • Figure 4: Issue with TCOW vanhoorick2023tcow metrics for amodal completion - the top row shows the input frames, whilst the middle row shows the query mask (left box) and target mask (right box). The bottom row presents the model's output heatmaps, highlighting that evaluating the model solely against the target masks overlooks the non-visible pixels that are missed in previous frames.
  • Figure 5: Pix2gestalt ozguroglu2024pix2gestalt cannot amodally complete with very little information. The left image showcases the frame of interest, the center image shows this frame with visible pixels labelled (green) and the ground truth amodal mask (blue), while the right image shows the frame with pix2gestalt's prediction mask (red).
  • ...and 13 more figures