Table of Contents
Fetching ...

Transformer-based Multimodal Change Detection with Multitask Consistency Constraints

Biyuan Liu, Huaixin Chen, Kun Li, Michael Ying Yang

TL;DR

This work addresses the gap in beyond-2D Earth observation change detection by leveraging DSM pre-event and post-event optical imagery in a Transformer-based multimodal framework (MMCD). It introduces a multitask consistency constraint that aligns height change (regression) with semantic change (classification) via a pseudo-change map derived from height, implemented through a soft-thresholding mechanism and a dedicated consistency loss. The Hi-BCD dataset provides high-resolution, cross-dimensional DSM-to-image pairs across three Dutch cities to benchmark simultaneous 2D semantic and 3D height changes. Empirical results show that the proposed consistency strategy improves both semantic and height-change performance, while maintaining lower model complexity than strong baselines, and the approach can be transferred to other methods. Overall, the work advances beyond-2D change detection and offers a practical dataset and method for robust, cross-modal urban change analysis with potential for broader multimodal applications.

Abstract

Change detection plays a fundamental role in Earth observation for analyzing temporal iterations over time. However, recent studies have largely neglected the utilization of multimodal data that presents significant practical and technical advantages compared to single-modal approaches. This research focuses on leveraging {pre-event} digital surface model (DSM) data and {post-event} digital aerial images captured at different times for detecting change beyond 2D. We observe that the current change detection methods struggle with the multitask conflicts between semantic and height change detection tasks. To address this challenge, we propose an efficient Transformer-based network that learns shared representation between cross-dimensional inputs through cross-attention. {It adopts a consistency constraint to establish the multimodal relationship. Initially, pseudo-changes are derived by employing height change thresholding. Subsequently, the $L2$ distance between semantic and pseudo-changes within their overlapping regions is minimized. This explicitly endows the height change detection (regression task) and semantic change detection (classification task) with representation consistency.} A DSM-to-image multimodal dataset encompassing three cities in the Netherlands was constructed. It lays a new foundation for beyond-2D change detection from cross-dimensional inputs. Compared to five state-of-the-art change detection methods, our model demonstrates consistent multitask superiority in terms of semantic and height change detection. Furthermore, the consistency strategy can be seamlessly adapted to the other methods, yielding promising improvements.

Transformer-based Multimodal Change Detection with Multitask Consistency Constraints

TL;DR

This work addresses the gap in beyond-2D Earth observation change detection by leveraging DSM pre-event and post-event optical imagery in a Transformer-based multimodal framework (MMCD). It introduces a multitask consistency constraint that aligns height change (regression) with semantic change (classification) via a pseudo-change map derived from height, implemented through a soft-thresholding mechanism and a dedicated consistency loss. The Hi-BCD dataset provides high-resolution, cross-dimensional DSM-to-image pairs across three Dutch cities to benchmark simultaneous 2D semantic and 3D height changes. Empirical results show that the proposed consistency strategy improves both semantic and height-change performance, while maintaining lower model complexity than strong baselines, and the approach can be transferred to other methods. Overall, the work advances beyond-2D change detection and offers a practical dataset and method for robust, cross-modal urban change analysis with potential for broader multimodal applications.

Abstract

Change detection plays a fundamental role in Earth observation for analyzing temporal iterations over time. However, recent studies have largely neglected the utilization of multimodal data that presents significant practical and technical advantages compared to single-modal approaches. This research focuses on leveraging {pre-event} digital surface model (DSM) data and {post-event} digital aerial images captured at different times for detecting change beyond 2D. We observe that the current change detection methods struggle with the multitask conflicts between semantic and height change detection tasks. To address this challenge, we propose an efficient Transformer-based network that learns shared representation between cross-dimensional inputs through cross-attention. {It adopts a consistency constraint to establish the multimodal relationship. Initially, pseudo-changes are derived by employing height change thresholding. Subsequently, the distance between semantic and pseudo-changes within their overlapping regions is minimized. This explicitly endows the height change detection (regression task) and semantic change detection (classification task) with representation consistency.} A DSM-to-image multimodal dataset encompassing three cities in the Netherlands was constructed. It lays a new foundation for beyond-2D change detection from cross-dimensional inputs. Compared to five state-of-the-art change detection methods, our model demonstrates consistent multitask superiority in terms of semantic and height change detection. Furthermore, the consistency strategy can be seamlessly adapted to the other methods, yielding promising improvements.
Paper Structure (16 sections, 10 equations, 17 figures, 7 tables)

This paper contains 16 sections, 10 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: The conceptual pipeline showing how multimodal image and DSM data are utilized for detecting height and semantic changes simultaneously.
  • Figure 2: The performance change of semantic (left) and height (right) change detection in a single-task and multitask manner, which implies the multitask conflicts between 2D semantic and height change detection.
  • Figure 3: Our Transformer-based multimodal change detection pipeline is named MMCD. It consists of the pyramid backbone with four Transformer layers, the cross-modal fusion module (CFM), and the multi-layer perception (MLP) decoder. The multitask consistency acts as an explicit constraint for enhancing multimodal correlation.
  • Figure 4: The structure of feature fusion module and decoder in our method.
  • Figure 5: The network variations of our method, arranged from left to right, include: the only semantic change detection branch, only height change detection branch, the multitask branch, and the multitask branch with consistency constraint.
  • ...and 12 more figures