Table of Contents
Fetching ...

SMITE: Segment Me In TimE

Amirhossein Alimohammadi, Sauradip Nag, Saeid Asgari Taghanaki, Andrea Tagliasacchi, Ghassan Hamarneh, Ali Mahdavi Amiri

TL;DR

SMITE addresses the challenge of video segmentation with flexible granularity by leveraging a pre-trained text-to-image diffusion model augmented with temporal tracking and low-pass regularization. It learns generalizable segment representations from a handful of reference images through text-embedding optimization and cross-attention fine-tuning, while enforcing temporal consistency via a bidirectional tracking-based voting on Weighted Accumulated Self-Attention maps. The method introduces an inflated UNet for video conditioning and a spatio-temporal guidance framework that minimizes a combined energy during inference to reduce flicker and maintain segmentation fidelity to references. Empirically, SMITE achieves superior performance on the new SMITE-50 dataset and competitive results on DAVIS/PUMaVOS benchmarks, with user studies confirming improved segmentation quality and temporal coherence, highlighting practical benefits for VFX and video editing workflows.

Abstract

Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives.

SMITE: Segment Me In TimE

TL;DR

SMITE addresses the challenge of video segmentation with flexible granularity by leveraging a pre-trained text-to-image diffusion model augmented with temporal tracking and low-pass regularization. It learns generalizable segment representations from a handful of reference images through text-embedding optimization and cross-attention fine-tuning, while enforcing temporal consistency via a bidirectional tracking-based voting on Weighted Accumulated Self-Attention maps. The method introduces an inflated UNet for video conditioning and a spatio-temporal guidance framework that minimizes a combined energy during inference to reduce flicker and maintain segmentation fidelity to references. Empirically, SMITE achieves superior performance on the new SMITE-50 dataset and competitive results on DAVIS/PUMaVOS benchmarks, with user studies confirming improved segmentation quality and temporal coherence, highlighting practical benefits for VFX and video editing workflows.

Abstract

Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives.

Paper Structure

This paper contains 21 sections, 17 equations, 15 figures, 10 tables, 1 algorithm.

Figures (15)

  • Figure 1: SMITE. Using only one or few segmentation references with fine granularity (left), our method learns to segment different unseen videos respecting the segmentation references.
  • Figure 2: SMITE pipeline. During inference (a), we invert a given video into a noisy latent by iteratively adding noise. We then use an inflated U-Net denoiser (b) along with the trained text embedding as input to denoise the segments. A tracking module ensures that the generated segments are spatially and temporally consistent via spatio-temporal guidance. The video latent $z_{t}$ is updated by a tracking energy $\mathcal{E}_{track}$ (c) that makes the segments temporally consistent and also a low-frequency regularizer (d) $\mathcal{E}_{reg}$ which guides the model towards better spatial consistency.
  • Figure 3: Video best viewed in Acrobat.
  • Figure 4: Segment tracking module ensures that segments are consistent across time. It uses co-tracker to track each point of the object's segment (here it is nose) and then finds point correspondence of this segment (denoted by blue dots) across timesteps. When the tracked point is of a different class (e.g,. face) then it is recovered by using temporal voting. The misclassified pixel is then replaced by the average of the neighbouring pixels of adjacent frames. This results are temporally consistent segments without visible flickers.
  • Figure 5: SMITE-50 Dataset sample.
  • ...and 10 more figures