Table of Contents
Fetching ...

UniChange: Unifying Change Detection with Multimodal Large Language Model

Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, Xiang Li

TL;DR

UniChange presents a unified framework for change detection by leveraging a multimodal large language model. It unifies binary and semantic change detection through an embedding-as-change paradigm that uses three special tokens to condition pixel-level segmentation on textual prompts, enabling joint training on diverse, potentially conflicting datasets. The approach achieves state-of-the-art performance across WHU-CD, S2Looking, LEVIR-CD+, and SECOND, validating its generality and practical impact for remote sensing applications. By eliminating task-specific heads and enabling end-to-end optimization, UniChange offers a scalable path toward versatile, cross-domain change analysis in geospatial monitoring.

Abstract

Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.

UniChange: Unifying Change Detection with Multimodal Large Language Model

TL;DR

UniChange presents a unified framework for change detection by leveraging a multimodal large language model. It unifies binary and semantic change detection through an embedding-as-change paradigm that uses three special tokens to condition pixel-level segmentation on textual prompts, enabling joint training on diverse, potentially conflicting datasets. The approach achieves state-of-the-art performance across WHU-CD, S2Looking, LEVIR-CD+, and SECOND, validating its generality and practical impact for remote sensing applications. By eliminating task-specific heads and enabling end-to-end optimization, UniChange offers a scalable path toward versatile, cross-domain change analysis in geospatial monitoring.

Abstract

Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.

Paper Structure

This paper contains 21 sections, 20 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Inconsistency in multi-source change detection datasets.
  • Figure 2: Inconsistency in model architecture.
  • Figure 3: The overview structure of UniChange is shown in (a). UniChange generates text outputs based on text instructions and dual-temporal remote sensing images. The text outputs incorporate a series of special tokens ([T1], [T2], and [CHANGE]). Concurrently, the dual-temporal remote sensing images undergo feature extraction via the Vision Backbone. These features, alongside the embeddings corresponding to the special tokens, are fed into the Token Driven Decoder. Ultimately, UniChange generates the masks corresponding to the text instructions. The overall structure of the Token Driven Decoder is shown in (b). It receives dual-temporal remote sensing image features and special token embeddings, generating corresponding masks.
  • Figure 4: Visualisation results from UniChange. The images for the first question are sourced from the binary change detection dataset WHU-CD. The images for the second question are sourced from the semantic change detection dataset SECOND.
  • Figure 5: Visual comparisons of the UniChange with other state-of-the-art methods for binary change detection. Red means false positives (FP), while Blue denotes false negatives (FN). Samples (a) (b) (c) are from the WHU-CD dataset, (d) (e) (f) are from the S2Looking dataset, and (g) (h) (i) are from the LEVIR-CD+ dataset.
  • ...and 1 more figures