Table of Contents
Fetching ...

Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective

Duowang Zhu, Xiaohu Huang, Haiyan Huang, Hao Zhou, Zhenfeng Shao

TL;DR

Change3D reframes bi-temporal change detection and captioning as a video modeling problem by inserting learnable perception frames between the timepoints, enabling a light-weight video encoder to directly model inter-frame changes and eliminate task-specific change extractors. The method demonstrates state-of-the-art performance across eight remote-sensing datasets for binary, semantic, and damage-change detection, as well as change captioning, while using far fewer parameters and FLOPs than 2D counterparts. The work provides strong empirical and theoretical support for a unified, video-based paradigm that improves information flow and inter-frame modeling, with practical implications for efficient, multi-task remote sensing analysis. It also offers detailed ablations on architectures, pre-training, and perceptual insertion strategies to guide future research. Overall, Change3D presents a simple yet powerful shift toward video-centric modeling for change understanding.

Abstract

In this paper, we present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling. Recent methods have achieved remarkable success by regarding each pair of bi-temporal images as separate frames. They employ a shared-weight image encoder to extract spatial features and then use a change extractor to capture differences between the two images. However, image feature encoding, being a task-agnostic process, cannot attend to changed regions effectively. Furthermore, different change extractors designed for various change detection and captioning tasks make it difficult to have a unified framework. To tackle these challenges, Change3D regards the bi-temporal images as comprising two frames akin to a tiny video. By integrating learnable perception frames between the bi-temporal images, a video encoder enables the perception frames to interact with the images directly and perceive their differences. Therefore, we can get rid of the intricate change extractors, providing a unified framework for different change detection and captioning tasks. We verify Change3D on multiple tasks, encompassing change detection (including binary change detection, semantic change detection, and building damage assessment) and change captioning, across eight standard benchmarks. Without bells and whistles, this simple yet effective framework can achieve superior performance with an ultra-light video model comprising only ~6%-13% of the parameters and ~8%-34% of the FLOPs compared to state-of-the-art methods. We hope that Change3D could be an alternative to 2D-based models and facilitate future research.

Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective

TL;DR

Change3D reframes bi-temporal change detection and captioning as a video modeling problem by inserting learnable perception frames between the timepoints, enabling a light-weight video encoder to directly model inter-frame changes and eliminate task-specific change extractors. The method demonstrates state-of-the-art performance across eight remote-sensing datasets for binary, semantic, and damage-change detection, as well as change captioning, while using far fewer parameters and FLOPs than 2D counterparts. The work provides strong empirical and theoretical support for a unified, video-based paradigm that improves information flow and inter-frame modeling, with practical implications for efficient, multi-task remote sensing analysis. It also offers detailed ablations on architectures, pre-training, and perceptual insertion strategies to guide future research. Overall, Change3D presents a simple yet powerful shift toward video-centric modeling for change understanding.

Abstract

In this paper, we present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling. Recent methods have achieved remarkable success by regarding each pair of bi-temporal images as separate frames. They employ a shared-weight image encoder to extract spatial features and then use a change extractor to capture differences between the two images. However, image feature encoding, being a task-agnostic process, cannot attend to changed regions effectively. Furthermore, different change extractors designed for various change detection and captioning tasks make it difficult to have a unified framework. To tackle these challenges, Change3D regards the bi-temporal images as comprising two frames akin to a tiny video. By integrating learnable perception frames between the bi-temporal images, a video encoder enables the perception frames to interact with the images directly and perceive their differences. Therefore, we can get rid of the intricate change extractors, providing a unified framework for different change detection and captioning tasks. We verify Change3D on multiple tasks, encompassing change detection (including binary change detection, semantic change detection, and building damage assessment) and change captioning, across eight standard benchmarks. Without bells and whistles, this simple yet effective framework can achieve superior performance with an ultra-light video model comprising only ~6%-13% of the parameters and ~8%-34% of the FLOPs compared to state-of-the-art methods. We hope that Change3D could be an alternative to 2D-based models and facilitate future research.

Paper Structure

This paper contains 23 sections, 6 equations, 11 figures, 20 tables.

Figures (11)

  • Figure 1: The parameter distribution in existing change detection and captioning methods indicates most parameters focused on image encoding, with few allocated to change extraction. This imbalance suggests an insufficient emphasis on task-related parameter learning. In contrast, our approach primarily focuses on video encoding, a task-specific process that effectively extracts changes.
  • Figure 2: Previous paradigm vs. our paradigm. (a) Previous paradigm treats bi-temporal image pairs as separate inputs, where each is processed individually by a shared-weight image encoder to extract spatial features, followed by a dedicated change extractor to capture differences and a decoder to make predictions. (b) Our proposed paradigm rethinks the change detection and captioning tasks from a video modeling perspective. By incorporating a learnable perception frame between the bi-temporal images, a video encoder facilitates direct interaction between the perception frame and images to extract differences, eliminating the need for intricate change extractors and providing a unified framework for multiple tasks.
  • Figure 3: Overall architectures of Change3D for Binary Change Detection, Semantic Change Detection, Building Damage Assessment, and Change Captioning. (a) Binary change detection necessitates acquiring a feature to represent changed targets, thus a perception frame is incorporated for sensing. (b) Semantic change detection involves representing semantic changes in $T_1$ and $T_2$ alongside binary changes. To accomplish this, three perception frames are integrated to facilitate semantic change learning. (c) Building damage assessment entails expressing two perception features for building localization and damage classification. Therefore, two perception frames are inserted to capture building damage. (d) Change captioning involves generating a feature that represents the altered content, thus incorporating a perception frame for interpreting content changes.
  • Figure 4: Visualization of bi-temporal features $F_1$, $F_2$, and extracted changes $F_C$. Our method directly focuses on changes during video encoding without intricate change extractors. The color bar on the right indicates the attention distribution for different colors.
  • Figure 5: Pre-training data size vs. performance on the xBD dataset.
  • ...and 6 more figures