Studying Image Diffusion Features for Zero-Shot Video Object Segmentation
Thanos Delatolas, Vicky Kalogeiton, Dim P. Papadopoulos
TL;DR
This work tackles zero-shot video object segmentation (ZS-VOS) without any video finetuning or segmentation data by exploiting pre-trained diffusion-model features. It systematically analyzes which diffusion model, diffusion timestep, and decoder layer yield the most discriminative representations, and introduces a memory-based affinity propagation framework augmented with a MAG-Filter and a prompt-learning module to improve correspondences. A key finding is that diffusion features trained on ImageNet outperform those trained on larger datasets, and that precise point correspondences substantially drive segmentation quality, with prompt learning further aligning cross-attention to the first-frame mask. The proposed ADM-based, MAG-filtered approach achieves state-of-the-art ZS-VOS on DAVIS-17 and MOSE, matching or surpassing methods trained on expensive image segmentation data, thus offering a training-free, scalable VOS solution.
Abstract
This paper investigates the use of large-scale diffusion models for Zero-Shot Video Object Segmentation (ZS-VOS) without fine-tuning on video data or training on any image segmentation data. While diffusion models have demonstrated strong visual representations across various tasks, their direct application to ZS-VOS remains underexplored. Our goal is to find the optimal feature extraction process for ZS-VOS by identifying the most suitable time step and layer from which to extract features. We further analyze the affinity of these features and observe a strong correlation with point correspondences. Through extensive experiments on DAVIS-17 and MOSE, we find that diffusion models trained on ImageNet outperform those trained on larger, more diverse datasets for ZS-VOS. Additionally, we highlight the importance of point correspondences in achieving high segmentation accuracy, and we yield state-of-the-art results in ZS-VOS. Finally, our approach performs on par with models trained on expensive image segmentation datasets.
