Table of Contents
Fetching ...

Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

Thanos Delatolas, Vicky Kalogeiton, Dim P. Papadopoulos

TL;DR

This work tackles zero-shot video object segmentation (ZS-VOS) without any video finetuning or segmentation data by exploiting pre-trained diffusion-model features. It systematically analyzes which diffusion model, diffusion timestep, and decoder layer yield the most discriminative representations, and introduces a memory-based affinity propagation framework augmented with a MAG-Filter and a prompt-learning module to improve correspondences. A key finding is that diffusion features trained on ImageNet outperform those trained on larger datasets, and that precise point correspondences substantially drive segmentation quality, with prompt learning further aligning cross-attention to the first-frame mask. The proposed ADM-based, MAG-filtered approach achieves state-of-the-art ZS-VOS on DAVIS-17 and MOSE, matching or surpassing methods trained on expensive image segmentation data, thus offering a training-free, scalable VOS solution.

Abstract

This paper investigates the use of large-scale diffusion models for Zero-Shot Video Object Segmentation (ZS-VOS) without fine-tuning on video data or training on any image segmentation data. While diffusion models have demonstrated strong visual representations across various tasks, their direct application to ZS-VOS remains underexplored. Our goal is to find the optimal feature extraction process for ZS-VOS by identifying the most suitable time step and layer from which to extract features. We further analyze the affinity of these features and observe a strong correlation with point correspondences. Through extensive experiments on DAVIS-17 and MOSE, we find that diffusion models trained on ImageNet outperform those trained on larger, more diverse datasets for ZS-VOS. Additionally, we highlight the importance of point correspondences in achieving high segmentation accuracy, and we yield state-of-the-art results in ZS-VOS. Finally, our approach performs on par with models trained on expensive image segmentation datasets.

Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

TL;DR

This work tackles zero-shot video object segmentation (ZS-VOS) without any video finetuning or segmentation data by exploiting pre-trained diffusion-model features. It systematically analyzes which diffusion model, diffusion timestep, and decoder layer yield the most discriminative representations, and introduces a memory-based affinity propagation framework augmented with a MAG-Filter and a prompt-learning module to improve correspondences. A key finding is that diffusion features trained on ImageNet outperform those trained on larger datasets, and that precise point correspondences substantially drive segmentation quality, with prompt learning further aligning cross-attention to the first-frame mask. The proposed ADM-based, MAG-filtered approach achieves state-of-the-art ZS-VOS on DAVIS-17 and MOSE, matching or surpassing methods trained on expensive image segmentation data, thus offering a training-free, scalable VOS solution.

Abstract

This paper investigates the use of large-scale diffusion models for Zero-Shot Video Object Segmentation (ZS-VOS) without fine-tuning on video data or training on any image segmentation data. While diffusion models have demonstrated strong visual representations across various tasks, their direct application to ZS-VOS remains underexplored. Our goal is to find the optimal feature extraction process for ZS-VOS by identifying the most suitable time step and layer from which to extract features. We further analyze the affinity of these features and observe a strong correlation with point correspondences. Through extensive experiments on DAVIS-17 and MOSE, we find that diffusion models trained on ImageNet outperform those trained on larger, more diverse datasets for ZS-VOS. Additionally, we highlight the importance of point correspondences in achieving high segmentation accuracy, and we yield state-of-the-art results in ZS-VOS. Finally, our approach performs on par with models trained on expensive image segmentation datasets.

Paper Structure

This paper contains 14 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: We leverage pre-trained diffusion models for Zero-Shot Video Object Segmentation by addressing key challenges: selecting the appropriate diffusion model, determining the optimal time step, identifying the best feature extraction layer, and designing an effective affinity matrix calculation strategy to match the features.
  • Figure 2: Sequentially segmenting a video with powerful feature extractors dinoRombach_2022_CVPR and past predictions. Given a memory of $N$ past frames and their corresponding predicted segmentation masks, we segment the query frame by first calculating the affinity matrix $\mathcal{A}$ between the query and memory frames, and then multiplying $\mathcal{A}$ with the past predicted segmentation masks.
  • Figure 3: Correspondences. (a) We show the FG-FG, BG-BG, and FG-BG correspondences. (b) We show the vectors of correspondences in the cartesian space. (c) We filter out the correspondences with our MAG-Filter.
  • Figure 4: Prompt Learning strategy in ZS-VOS. Given the first frame of the video, $I_1$, and its corresponding segmentation mask, $m_1$, we optimize a text token so that its cross-attention map, $m_{ca}$, approximates $m_1$.
  • Figure 5: Ablation on layer and time step. We show the $\mathcal{J}\&\mathcal{F}$ accuracy on the DAVIS-17 val set davis_17 for Stable Diffusion (v 1.2 to 1.5 and 2.1), as well as the Ablated Diffusion Model (ADM) adm, as a function of the diffusion time step and the decoder layer of the U-Net.
  • ...and 3 more figures