Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation
Finlay G. C. Hudson, William A. P. Smith
TL;DR
The paper tackles zero-shot amodal video segmentation by leveraging a single visible query mask and temporal context to infer complete object shapes behind occluders. It introduces TABE, a diffusion-based outpainting pipeline guided by per-frame target regions derived from depth and approximate amodal boxes, and TABE-51, a realistic occlusion-rich dataset created via compositing real video clips. A dedicated occlusion reasoning module and finetuning regime improve temporal coherence and amodal fidelity, with evaluation metrics that isolate amodal completion. Results show TABE consistently outperforms contemporary baselines across occlusion-focused metrics, highlighting the practical potential for robust occlusion handling in video understanding tasks.
Abstract
We present Track Anything Behind Everything (TABE), a novel dataset, pipeline, and evaluation framework for zero-shot amodal completion from visible masks. Unlike existing methods that require pretrained class labels, our approach uses a single query mask from the first frame where the object is visible, enabling flexible, zero-shot inference. Our dataset, TABE-51 provides highly accurate ground truth amodal segmentation masks without the need for human estimation or 3D reconstruction. Our TABE pipeline is specifically designed to handle amodal completion, even in scenarios where objects are completely occluded. We also introduce a specialised evaluation framework that isolates amodal completion performance, free from the influence of traditional visual segmentation metrics.
