Table of Contents
Fetching ...

RSEdit: Text-Guided Image Editing for Remote Sensing

Chen Zhenyuan, Zhang Zechuan, Zhang Feng

Abstract

General-domain text-guided image editors achieve strong photorealism but introduce artifacts, hallucinate objects, and break the orthographic constraints of remote sensing (RS) imagery. We trace this gap to two high-level causes: (i) limited RS world knowledge in pre-trained models, and (ii) conditioning schemes that misalign with the bi-temporal structure and spatial priors of Earth observation data. We present RSEdit, a unified framework that adapts pretrained text-to-image diffusion models - both U-Net and DiT - into instruction-following RS editors via channel concatenation and in-context token concatenation. Trained on over 60,000 semantically rich bi-temporal remote sensing image pairs, RSEdit learns precise, physically coherent edits while preserving geospatial content. Experiments show clear gains over general and commercial baselines, demonstrating strong generalizability across diverse scenarios including disaster impacts, urban growth, and seasonal shifts, positioning RSEdit as a robust data engine for downstream analysis. We will release code, pretrained models, evaluation protocols, training logs, and generated results for full reproducibility. Code: https://github.com/Bili-Sakura/RSEdit-Preview

RSEdit: Text-Guided Image Editing for Remote Sensing

Abstract

General-domain text-guided image editors achieve strong photorealism but introduce artifacts, hallucinate objects, and break the orthographic constraints of remote sensing (RS) imagery. We trace this gap to two high-level causes: (i) limited RS world knowledge in pre-trained models, and (ii) conditioning schemes that misalign with the bi-temporal structure and spatial priors of Earth observation data. We present RSEdit, a unified framework that adapts pretrained text-to-image diffusion models - both U-Net and DiT - into instruction-following RS editors via channel concatenation and in-context token concatenation. Trained on over 60,000 semantically rich bi-temporal remote sensing image pairs, RSEdit learns precise, physically coherent edits while preserving geospatial content. Experiments show clear gains over general and commercial baselines, demonstrating strong generalizability across diverse scenarios including disaster impacts, urban growth, and seasonal shifts, positioning RSEdit as a robust data engine for downstream analysis. We will release code, pretrained models, evaluation protocols, training logs, and generated results for full reproducibility. Code: https://github.com/Bili-Sakura/RSEdit-Preview
Paper Structure (27 sections, 8 equations, 10 figures, 6 tables)

This paper contains 27 sections, 8 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of the RSEdit framework. We propose a universal adaptation strategy that aligns the conditioning mechanism with the architecture's inductive bias. For U-Net backbones (left), we use channel concatenation to leverage convolutional priors. For DiT backbones (right), we use token concatenation to exploit the in-context learning capabilities of transformers.
  • Figure 2: Proposed change-centric evaluation metric using building damage assessment. We leverage a pre-trained ChangeStar model to produce semantic masks for the edited image and compare them to ground-truth damage masks to quantify editing accuracy.
  • Figure 3: Qualitative comparison of RSEdit against general-domain baselines on disaster simulation scenarios. Columns show, from left to right: Edit Prompt, Input (pre-event), Reference (post-event), RSEdit-UNet (Ours), RSEdit-DiT (Ours), InstructPix2Pix, and UltraEdit. RSEdit variants realistically simulate disaster impacts with high quantitative accuracy (e.g., in Storm: RSEdit-UNet 91.23% $F1_{\text{Dam}}$, RSEdit-DiT 91.47%, vs. InstructPix2Pix 0%), whereas baselines fail to follow instructions or introduce artifacts. Noted: $F1_{\text{Dam}}$ refers to $F1_{\text{weighted}}$ here.
  • Figure 4: Qualitative results on SECOND-CC and LEVIR-CC benchmarks for out-of-domain generalization. Columns show, from left to right: Edit Prompt, Input (pre-event), Reference (post-event), RSEdit-UNet (Ours), RSEdit-DiT (Ours), InstructPix2Pix, and UltraEdit. Our model trained on RSCC generalizes well to unseen benchmarks without fine-tuning, accurately following complex change descriptions while maintaining geospatial coherence.
  • Figure 5: Comparison of qualitative results for the Guatemala Volcano scenario. See full prompt in appendix.
  • ...and 5 more figures