Table of Contents
Fetching ...

Zero-Shot Scene Change Detection

Kyusik Cho, Dong Yeop Kim, Euntai Kim

TL;DR

The paper tackles Scene Change Detection without training data by reusing a pre-trained tracking model to compare a reference image with a query image, reframing SCD as a tracking problem. It introduces two training-free mechanisms—a content-adaptive threshold and a style bridging layer—to address content gaps and style variations, respectively, and extends the approach to video for enhanced temporal reasoning. Through experiments on ChangeSim, VL-CMU-CD, and PCD, the method demonstrates robust cross-domain performance and competitive results relative to trained baselines, without data-label costs. The work offers practical benefits for real-world deployment where style variation and labeling costs hinder traditional supervised SCD methods, and it provides a versatile framework for zero-shot SCD in both images and video.

Abstract

We present a novel, training-free approach to scene change detection. Our method leverages tracking models, which inherently perform change detection between consecutive frames of video by identifying common objects and detecting new or missing objects. Specifically, our method takes advantage of the change detection effect of the tracking model by inputting reference and query images instead of consecutive frames. Furthermore, we focus on the content gap and style gap between two input images in change detection, and address both issues by proposing adaptive content threshold and style bridging layers, respectively. Finally, we extend our approach to video, leveraging rich temporal information to enhance the performance of scene change detection. We compare our approach and baseline through various experiments. While existing train-based baseline tend to specialize only in the trained domain, our method shows consistent performance across various domains, proving the competitiveness of our approach.

Zero-Shot Scene Change Detection

TL;DR

The paper tackles Scene Change Detection without training data by reusing a pre-trained tracking model to compare a reference image with a query image, reframing SCD as a tracking problem. It introduces two training-free mechanisms—a content-adaptive threshold and a style bridging layer—to address content gaps and style variations, respectively, and extends the approach to video for enhanced temporal reasoning. Through experiments on ChangeSim, VL-CMU-CD, and PCD, the method demonstrates robust cross-domain performance and competitive results relative to trained baselines, without data-label costs. The work offers practical benefits for real-world deployment where style variation and labeling costs hinder traditional supervised SCD methods, and it provides a versatile framework for zero-shot SCD in both images and video.

Abstract

We present a novel, training-free approach to scene change detection. Our method leverages tracking models, which inherently perform change detection between consecutive frames of video by identifying common objects and detecting new or missing objects. Specifically, our method takes advantage of the change detection effect of the tracking model by inputting reference and query images instead of consecutive frames. Furthermore, we focus on the content gap and style gap between two input images in change detection, and address both issues by proposing adaptive content threshold and style bridging layers, respectively. Finally, we extend our approach to video, leveraging rich temporal information to enhance the performance of scene change detection. We compare our approach and baseline through various experiments. While existing train-based baseline tend to specialize only in the trained domain, our method shows consistent performance across various domains, proving the competitiveness of our approach.
Paper Structure (26 sections, 8 equations, 9 figures, 10 tables)

This paper contains 26 sections, 8 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: The basic idea of SCD with tracking model. (a) We execute the tracking model $G$ with $r$ and $q$. (b) We denote the tracking result from $r$ to $q$ as $M^{r \to q} = G(r, q, M^r)$, and the tracking result from $q$ to $r$ as $M^{q \to r} = G(q, r, M^q)$. (c) 'Missing' objects are the objects that exist in $r$ but not in $q$. Therefore, we compare $M^r$ and $M^{r \to q}$ to find missing objects. Conversely, 'new' objects are identified by comparing $M^q$ and $M^{q \to r}$. (d) The final prediction is the simple combination of new and missing.
  • Figure 2: Illustration of the content threshold. Since the yellow forklift in $q$ has disappeared in $r$, all the three masks (blue, red, and yellow masks) in $M^q$ have no associated masks in $M^{q \to r}$. However, the tracking model creates a small area of the blue mask in $M^{q \to r}$ due to the content gap. This makes it mistakenly classified as a static object. To address this, we propose a content threshold to filter out masks whose area significantly reduces after tracking.
  • Figure 3: Illustration of the style bridging layer. During the processing of the first image, the style is saved while the feature is passed through unchanged. When processing the second image, the saved style is applied to the feature.
  • Figure 4: Zero-shot SCD in video. We conduct SCD on video sequences by providing sequence pairs instead of image pairs as input to the tracking model $G$. For each frame, the mask is propagated from the previous frame, resulting in a mask sequence through repeated propagation. SCD in the video is finalized by comparing the mask sequences.
  • Figure 5: Qualitative results. Our approach successfully performs change detection across various datasets without training. For more qualitative results, see the supplementary material.
  • ...and 4 more figures