NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures

Yufan Wang; Sokratis Makrogiannis; Chandra Kambhamettu

NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures

Yufan Wang, Sokratis Makrogiannis, Chandra Kambhamettu

TL;DR

This paper proposes NeXt2Former-CD, an end-to-end framework that integrates a Siamese ConvNeXt encoder initialized with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder to better tolerate residual co-registration noise and small object-level spatial shifts, as well as semantic ambiguity in bi-temporal imagery.

Abstract

State Space Models (SSMs) have recently gained traction in remote sensing change detection (CD) for their favorable scaling properties. In this paper, we explore the potential of modern convolutional and attention-based architectures as a competitive alternative. We propose NeXt2Former-CD, an end-to-end framework that integrates a Siamese ConvNeXt encoder initialized with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder. This design is intended to better tolerate residual co-registration noise and small object-level spatial shifts, as well as semantic ambiguity in bi-temporal imagery. Experiments on LEVIR-CD, WHU-CD, and CDD datasets show that our method achieves the best results among the evaluated methods, improving over recent Mamba-based baselines in both F1 score and IoU. Furthermore, despite a larger parameter count, our model maintains inference latency comparable to SSM-based approaches, suggesting it is practical for high-resolution change detection tasks.

NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures

TL;DR

Abstract

Paper Structure (17 sections, 5 equations, 3 figures, 7 tables)

This paper contains 17 sections, 5 equations, 3 figures, 7 tables.

Introduction
Related Work
Deep Learning-Based Change Detection
State Space Models and Mamba in Remote Sensing
Foundation Models and Universal Segmentation
Method
Overview
Siamese DINOv3 Backbone
Spatiotemporal Feature Interaction
Mask2Former Decoder and Hybrid Loss
Experiments
Datasets
Experimental Setup
Results
Conclusion
...and 2 more sections

Figures (3)

Figure 1: Overview of the proposed change detection network. It consists of a weight-sharing DINOv3 simeoni_dinov3_2025 backbone, Feature Rectify Modules (FRM) and Feature Fusion Modules (FFM) at multiple scales, and a Mask2Former head for final prediction. Query-level class logits and masks are aggregated into dense predictions via log-sum-exp over queries.
Figure 2: Qualitative results on three public datasets. White represents true positives, black represents true negatives, green represents false positives and red represents false negatives.
Figure 3: Validation IoU curves for M-CD paranjape_mamba-based_2024 and the proposed method. The vertical axis shows IoU for the change class.

NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures

TL;DR

Abstract

NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (3)