Table of Contents
Fetching ...

SemiCD-VL: Visual-Language Model Guidance Makes Better Semi-supervised Change Detector

Kaiyu Li, Xiangyong Cao, Yupeng Deng, Jiayi Song, Junmin Liu, Deyu Meng, Zhi Wang

TL;DR

SemiCD-VL addresses the challenge of change detection with limited labels by leveraging a vision-language model (VLM) to generate pseudo-change labels and guide semi-supervised learning. It introduces a five-part framework: mixed Change Event Generation (CEG) to produce reliable pseudo-labels, a VLM-guided loss, a dual projection head to separate sources of supervision, decoupled single-temporal semantic guidance via auxiliary decoders, and contrastive consistency regularization to sharpen change representations. The method achieves substantial gains over FixMatch and other baselines on LEVIR-CD and WHU-CD with only 5% labeled data, and the CEG approach yields strong unsupervised change detection performance as well. These results demonstrate the potential of plug-and-play VLM guidance for dense CD tasks and open avenues for cross-domain and end-to-end multi-temporal VLM integration in Earth observation pipelines.

Abstract

Change Detection (CD) aims to identify pixels with semantic changes between images. However, annotating massive numbers of pixel-level images is labor-intensive and costly, especially for multi-temporal images, which require pixel-wise comparisons by human experts. Considering the excellent performance of visual language models (VLMs) for zero-shot, open-vocabulary, etc. with prompt-based reasoning, it is promising to utilize VLMs to make better CD under limited labeled data. In this paper, we propose a VLM guidance-based semi-supervised CD method, namely SemiCD-VL. The insight of SemiCD-VL is to synthesize free change labels using VLMs to provide additional supervision signals for unlabeled data. However, almost all current VLMs are designed for single-temporal images and cannot be directly applied to bi- or multi-temporal images. Motivated by this, we first propose a VLM-based mixed change event generation (CEG) strategy to yield pseudo labels for unlabeled CD data. Since the additional supervised signals provided by these VLM-driven pseudo labels may conflict with the pseudo labels from the consistency regularization paradigm (e.g. FixMatch), we propose the dual projection head for de-entangling different signal sources. Further, we explicitly decouple the bi-temporal images semantic representation through two auxiliary segmentation decoders, which are also guided by VLM. Finally, to make the model more adequately capture change representations, we introduce metric-aware supervision by feature-level contrastive loss in auxiliary branches. Extensive experiments show the advantage of SemiCD-VL. For instance, SemiCD-VL improves the FixMatch baseline by +5.3 IoU on WHU-CD and by +2.4 IoU on LEVIR-CD with 5% labels. In addition, our CEG strategy, in an un-supervised manner, can achieve performance far superior to state-of-the-art un-supervised CD methods.

SemiCD-VL: Visual-Language Model Guidance Makes Better Semi-supervised Change Detector

TL;DR

SemiCD-VL addresses the challenge of change detection with limited labels by leveraging a vision-language model (VLM) to generate pseudo-change labels and guide semi-supervised learning. It introduces a five-part framework: mixed Change Event Generation (CEG) to produce reliable pseudo-labels, a VLM-guided loss, a dual projection head to separate sources of supervision, decoupled single-temporal semantic guidance via auxiliary decoders, and contrastive consistency regularization to sharpen change representations. The method achieves substantial gains over FixMatch and other baselines on LEVIR-CD and WHU-CD with only 5% labeled data, and the CEG approach yields strong unsupervised change detection performance as well. These results demonstrate the potential of plug-and-play VLM guidance for dense CD tasks and open avenues for cross-domain and end-to-end multi-temporal VLM integration in Earth observation pipelines.

Abstract

Change Detection (CD) aims to identify pixels with semantic changes between images. However, annotating massive numbers of pixel-level images is labor-intensive and costly, especially for multi-temporal images, which require pixel-wise comparisons by human experts. Considering the excellent performance of visual language models (VLMs) for zero-shot, open-vocabulary, etc. with prompt-based reasoning, it is promising to utilize VLMs to make better CD under limited labeled data. In this paper, we propose a VLM guidance-based semi-supervised CD method, namely SemiCD-VL. The insight of SemiCD-VL is to synthesize free change labels using VLMs to provide additional supervision signals for unlabeled data. However, almost all current VLMs are designed for single-temporal images and cannot be directly applied to bi- or multi-temporal images. Motivated by this, we first propose a VLM-based mixed change event generation (CEG) strategy to yield pseudo labels for unlabeled CD data. Since the additional supervised signals provided by these VLM-driven pseudo labels may conflict with the pseudo labels from the consistency regularization paradigm (e.g. FixMatch), we propose the dual projection head for de-entangling different signal sources. Further, we explicitly decouple the bi-temporal images semantic representation through two auxiliary segmentation decoders, which are also guided by VLM. Finally, to make the model more adequately capture change representations, we introduce metric-aware supervision by feature-level contrastive loss in auxiliary branches. Extensive experiments show the advantage of SemiCD-VL. For instance, SemiCD-VL improves the FixMatch baseline by +5.3 IoU on WHU-CD and by +2.4 IoU on LEVIR-CD with 5% labels. In addition, our CEG strategy, in an un-supervised manner, can achieve performance far superior to state-of-the-art un-supervised CD methods.
Paper Structure (30 sections, 18 equations, 6 figures, 9 tables)

This paper contains 30 sections, 18 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Inputs and outputs of SemiCD-VL. (a) and (b) represent the bi-temporal images and their change label. (c) denotes the semantic segmentation masks from VLM's single-temporal image reasoning, and we use the mixed CEG algorithm to convert them into change mask (d) as the supplementary supervised signal. (e) is the prediction of SemiCD-VL after semi-supervised training (white rendering indicates pixels with semantic changes, black indicates no semantic changes, and gray indicates unreliable regions, which are ignored in the loss computation).
  • Figure 2: Overview of our SemiCD-VL framework. Utilizing the rich semantic representation of VLM, we propose 5 strategies (highlighted in red) to guide semi-supervised CD: We introduce the mixed CEG strategy in (1) that combines pixel-level CEG and instance-level CEG to generate reliable change masks, which guide the learning of unlabeled samples in (2). To avoid conflicts with supervised signals under the consistency regularization framework, dual projection heads are introduced in (3). Then, two auxiliary segmentation decoders are activated during the training phase to decouple the process of change prediction in (4), also benefiting from VLM guidance. Finally, contrastive consistency regularization is applied in (5) to make the model capture the change representation more explicitly. denotes the weights are frozen. For clarity, the features of the segmentation decoders for both weak and strong perturbations are denoted by $q_{t_1}$ and $q_{t_2}$. Components (1)-(5) correspond to Sections \ref{['sec:mix_ceg']} to \ref{['sec:ccr']}.
  • Figure 3: Visualization of direct inference using VLM with prompts: house, building, road, grass, tree, water. (The color rendering is random, just to distinguish different categories.)
  • Figure 4: The influence of category definition on VLM reasoning. (b) denotes the prediction mask when only foreground categories are defined, and (c) denotes the prediction mask when categories for both foreground and background are defined. The red frames in (a) indicate targets that are incorrectly assigned to the background when only the foreground category is defined. (The color rendering is random, just to distinguish different categories.)
  • Figure 5: Visualization of the change mask generated by pixel-level CEG and instance-level CEG. The white noise in (d) indicates the non-semantic changes due to object misalignment, which are erased by instance-level CEG in (e). White rendering indicates pixels with semantic changes, black indicates no semantic changes, and gray indicates unreliable regions, which are ignored in the loss computation.
  • ...and 1 more figures