SMITE: Segment Me In TimE
Amirhossein Alimohammadi, Sauradip Nag, Saeid Asgari Taghanaki, Andrea Tagliasacchi, Ghassan Hamarneh, Ali Mahdavi Amiri
TL;DR
SMITE addresses the challenge of video segmentation with flexible granularity by leveraging a pre-trained text-to-image diffusion model augmented with temporal tracking and low-pass regularization. It learns generalizable segment representations from a handful of reference images through text-embedding optimization and cross-attention fine-tuning, while enforcing temporal consistency via a bidirectional tracking-based voting on Weighted Accumulated Self-Attention maps. The method introduces an inflated UNet for video conditioning and a spatio-temporal guidance framework that minimizes a combined energy during inference to reduce flicker and maintain segmentation fidelity to references. Empirically, SMITE achieves superior performance on the new SMITE-50 dataset and competitive results on DAVIS/PUMaVOS benchmarks, with user studies confirming improved segmentation quality and temporal coherence, highlighting practical benefits for VFX and video editing workflows.
Abstract
Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives.
