Table of Contents
Fetching ...

NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures

Yufan Wang, Sokratis Makrogiannis, Chandra Kambhamettu

TL;DR

This paper proposes NeXt2Former-CD, an end-to-end framework that integrates a Siamese ConvNeXt encoder initialized with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder to better tolerate residual co-registration noise and small object-level spatial shifts, as well as semantic ambiguity in bi-temporal imagery.

Abstract

State Space Models (SSMs) have recently gained traction in remote sensing change detection (CD) for their favorable scaling properties. In this paper, we explore the potential of modern convolutional and attention-based architectures as a competitive alternative. We propose NeXt2Former-CD, an end-to-end framework that integrates a Siamese ConvNeXt encoder initialized with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder. This design is intended to better tolerate residual co-registration noise and small object-level spatial shifts, as well as semantic ambiguity in bi-temporal imagery. Experiments on LEVIR-CD, WHU-CD, and CDD datasets show that our method achieves the best results among the evaluated methods, improving over recent Mamba-based baselines in both F1 score and IoU. Furthermore, despite a larger parameter count, our model maintains inference latency comparable to SSM-based approaches, suggesting it is practical for high-resolution change detection tasks.

NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures

TL;DR

This paper proposes NeXt2Former-CD, an end-to-end framework that integrates a Siamese ConvNeXt encoder initialized with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder to better tolerate residual co-registration noise and small object-level spatial shifts, as well as semantic ambiguity in bi-temporal imagery.

Abstract

State Space Models (SSMs) have recently gained traction in remote sensing change detection (CD) for their favorable scaling properties. In this paper, we explore the potential of modern convolutional and attention-based architectures as a competitive alternative. We propose NeXt2Former-CD, an end-to-end framework that integrates a Siamese ConvNeXt encoder initialized with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder. This design is intended to better tolerate residual co-registration noise and small object-level spatial shifts, as well as semantic ambiguity in bi-temporal imagery. Experiments on LEVIR-CD, WHU-CD, and CDD datasets show that our method achieves the best results among the evaluated methods, improving over recent Mamba-based baselines in both F1 score and IoU. Furthermore, despite a larger parameter count, our model maintains inference latency comparable to SSM-based approaches, suggesting it is practical for high-resolution change detection tasks.
Paper Structure (17 sections, 5 equations, 3 figures, 7 tables)

This paper contains 17 sections, 5 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of the proposed change detection network. It consists of a weight-sharing DINOv3 simeoni_dinov3_2025 backbone, Feature Rectify Modules (FRM) and Feature Fusion Modules (FFM) at multiple scales, and a Mask2Former head for final prediction. Query-level class logits and masks are aggregated into dense predictions via log-sum-exp over queries.
  • Figure 2: Qualitative results on three public datasets. White represents true positives, black represents true negatives, green represents false positives and red represents false negatives.
  • Figure 3: Validation IoU curves for M-CD paranjape_mamba-based_2024 and the proposed method. The vertical axis shows IoU for the change class.