Table of Contents
Fetching ...

Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models

Qian Wang, Abdelrahman Eldesokey, Mohit Mendiratta, Fangneng Zhan, Adam Kortylewski, Christian Theobalt, Peter Wonka

TL;DR

This work tackles zero-shot Video Semantic Segmentation by leveraging pre-trained diffusion models. It introduces a three-component pipeline—a Scene Context Model, a Correspondence-Based Refinement, and a Masked Modulation—to produce temporally coherent per-frame segmentations without any training. Across VSPW, Cityscapes, and CamVid, the method substantially outperforms zero-shot image segmentation baselines and approaches supervised VSS on VSPW, with SD-based features delivering the strongest performance. The approach demonstrates diffusion features as a flexible foundation for video segmentation, offering strong temporal consistency and practical zero-shot capabilities.

Abstract

We introduce the first zero-shot approach for Video Semantic Segmentation (VSS) based on pre-trained diffusion models. A growing research direction attempts to employ diffusion models to perform downstream vision tasks by exploiting their deep understanding of image semantics. Yet, the majority of these approaches have focused on image-related tasks like semantic correspondence and segmentation, with less emphasis on video tasks such as VSS. Ideally, diffusion-based image semantic segmentation approaches can be applied to videos in a frame-by-frame manner. However, we find their performance on videos to be subpar due to the absence of any modeling of temporal information inherent in the video data. To this end, we tackle this problem and introduce a framework tailored for VSS based on pre-trained image and video diffusion models. We propose building a scene context model based on the diffusion features, where the model is autoregressively updated to adapt to scene changes. This context model predicts per-frame coarse segmentation maps that are temporally consistent. To refine these maps further, we propose a correspondence-based refinement strategy that aggregates predictions temporally, resulting in more confident predictions. Finally, we introduce a masked modulation approach to upsample the coarse maps to the full resolution at a high quality. Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches significantly on various VSS benchmarks without any training or fine-tuning. Moreover, it rivals supervised VSS approaches on the VSPW dataset despite not being explicitly trained for VSS.

Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models

TL;DR

This work tackles zero-shot Video Semantic Segmentation by leveraging pre-trained diffusion models. It introduces a three-component pipeline—a Scene Context Model, a Correspondence-Based Refinement, and a Masked Modulation—to produce temporally coherent per-frame segmentations without any training. Across VSPW, Cityscapes, and CamVid, the method substantially outperforms zero-shot image segmentation baselines and approaches supervised VSS on VSPW, with SD-based features delivering the strongest performance. The approach demonstrates diffusion features as a flexible foundation for video segmentation, offering strong temporal consistency and practical zero-shot capabilities.

Abstract

We introduce the first zero-shot approach for Video Semantic Segmentation (VSS) based on pre-trained diffusion models. A growing research direction attempts to employ diffusion models to perform downstream vision tasks by exploiting their deep understanding of image semantics. Yet, the majority of these approaches have focused on image-related tasks like semantic correspondence and segmentation, with less emphasis on video tasks such as VSS. Ideally, diffusion-based image semantic segmentation approaches can be applied to videos in a frame-by-frame manner. However, we find their performance on videos to be subpar due to the absence of any modeling of temporal information inherent in the video data. To this end, we tackle this problem and introduce a framework tailored for VSS based on pre-trained image and video diffusion models. We propose building a scene context model based on the diffusion features, where the model is autoregressively updated to adapt to scene changes. This context model predicts per-frame coarse segmentation maps that are temporally consistent. To refine these maps further, we propose a correspondence-based refinement strategy that aggregates predictions temporally, resulting in more confident predictions. Finally, we introduce a masked modulation approach to upsample the coarse maps to the full resolution at a high quality. Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches significantly on various VSS benchmarks without any training or fine-tuning. Moreover, it rivals supervised VSS approaches on the VSPW dataset despite not being explicitly trained for VSS.
Paper Structure (29 sections, 6 equations, 15 figures, 3 tables)

This paper contains 29 sections, 6 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: We propose the first zero-shot diffusion-based approach for Video Semantic Segmentation (VSS). Our approach produces temporally consistent predictions compared to the diffusion-based image segmentation method EmerDiff namekata2024emerdiff.
  • Figure 2: A visualization of the first three PCA components for the features of two video frames extracted from the most semantically-rich blocks in SD (Block 7) and SVD (Bock 8). In the second row, we show the $x$-$t$ slice of an image row (highlighted in the red line in the leftmost image) horizontally across the PCA visualization ($x$-axis) and stack it chronologically across the full batch of video frames ($t$-axis). The plot shows that the spatial features of both SD and SVD are temporally more consistent between video frames compared to the features of temporal layers in SVD.
  • Figure 3: Our Video Semantic Segmentation (VSS) approach encompasses three stages. In Stage 1, we initialize a Scene Context Model $\Omega$ as a KNN classifier with the aggregated diffusion features $\widetilde{F}^1$ of the first frame and a coarse mask $M^1$ produced by K-Means clustering. In Stage 2, we use the context model $\Omega$ to predict coarse masks for the remaining frames in the batch $M^{2\dots B}$. We refine the coarse maps $M^{1 \dots B}$ using our correspondence-based refinement (CBR). In Stage 3, we use the refined coarse masks to modulate the attention layers of the diffusion process with factor $\pm \lambda$ to obtain a modulated latent $\hat{z}_t$. Then, we blend $\hat{z}_t$ with the original unmodulated latent $z_t$ using the coarse masks to obtain a less noisy latent $\widetilde{z}_t$. Finally, the latent $\widetilde{z}_t$ is decoded to obtain images $I^+, I^-$ that are used to compute a set of difference maps per segment $l \in L$. The final predictions are made by applying an $\arg \max$ operation over the difference maps similar to namekata2024emerdiff. The process is repeated for the following batch of frames where the context model is updated in an autoregressive manner using the coarse masks $M^{1\dots B}$ and their corresponding features $\widetilde{F}^{1\dots B}$.
  • Figure 4: A detailed illustration of (a) the scene context model and (b) correspondence-based refinement.
  • Figure 5: Qualitative comparison of different zero-shot methods. Note that the color of a segmentation cluster only represents the relative index of the clusters when the video is processed. The color itself does not map to an absolute label.
  • ...and 10 more figures