Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective

Duowang Zhu; Xiaohu Huang; Haiyan Huang; Hao Zhou; Zhenfeng Shao

Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective

Duowang Zhu, Xiaohu Huang, Haiyan Huang, Hao Zhou, Zhenfeng Shao

TL;DR

Change3D reframes bi-temporal change detection and captioning as a video modeling problem by inserting learnable perception frames between the timepoints, enabling a light-weight video encoder to directly model inter-frame changes and eliminate task-specific change extractors. The method demonstrates state-of-the-art performance across eight remote-sensing datasets for binary, semantic, and damage-change detection, as well as change captioning, while using far fewer parameters and FLOPs than 2D counterparts. The work provides strong empirical and theoretical support for a unified, video-based paradigm that improves information flow and inter-frame modeling, with practical implications for efficient, multi-task remote sensing analysis. It also offers detailed ablations on architectures, pre-training, and perceptual insertion strategies to guide future research. Overall, Change3D presents a simple yet powerful shift toward video-centric modeling for change understanding.

Abstract

In this paper, we present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling. Recent methods have achieved remarkable success by regarding each pair of bi-temporal images as separate frames. They employ a shared-weight image encoder to extract spatial features and then use a change extractor to capture differences between the two images. However, image feature encoding, being a task-agnostic process, cannot attend to changed regions effectively. Furthermore, different change extractors designed for various change detection and captioning tasks make it difficult to have a unified framework. To tackle these challenges, Change3D regards the bi-temporal images as comprising two frames akin to a tiny video. By integrating learnable perception frames between the bi-temporal images, a video encoder enables the perception frames to interact with the images directly and perceive their differences. Therefore, we can get rid of the intricate change extractors, providing a unified framework for different change detection and captioning tasks. We verify Change3D on multiple tasks, encompassing change detection (including binary change detection, semantic change detection, and building damage assessment) and change captioning, across eight standard benchmarks. Without bells and whistles, this simple yet effective framework can achieve superior performance with an ultra-light video model comprising only ~6%-13% of the parameters and ~8%-34% of the FLOPs compared to state-of-the-art methods. We hope that Change3D could be an alternative to 2D-based models and facilitate future research.

Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective

TL;DR

Abstract

Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)