Table of Contents
Fetching ...

CC-Diff: Enhancing Contextual Coherence in Remote Sensing Image Synthesis

Mu Zhang, Yunfan Liu, Yue Liu, Yuzhong Zhao, Qixiang Ye

TL;DR

CC-Diff tackles the pivotal problem of foreground-background contextual coherence in remote sensing image synthesis by introducing a cross-modular, diffusion-based framework. It utilizes a Dual Re-sampler to extract FG and BG semantics with a Context Bridge that ties foreground context into background feature extraction, and a CGM that renders FG in parallel while FG-aware BG blending enhances plausibility. Empirical results show CC-Diff achieves state-of-the-art fidelity and faithfulness on RS datasets and generalizes effectively to natural images, while also improving detection performance when used for data augmentation (e.g., +$1.83$ to $2.25$ AP). The approach demonstrates strong trainability and practical impact for RS analysis and downstream tasks, and it supports LLM-guided layout guidance to further boost performance and diversity.

Abstract

Existing image synthesis methods for natural scenes focus primarily on foreground control, often reducing the background to simplistic textures. Consequently, these approaches tend to overlook the intrinsic correlation between foreground and background, which may lead to incoherent and unrealistic synthesis results in remote sensing (RS) scenarios. In this paper, we introduce CC-Diff, a $\underline{\textbf{Diff}}$usion Model-based approach for RS image generation with enhanced $\underline{\textbf{C}}$ontext $\underline{\textbf{C}}$oherence. Specifically, we propose a novel Dual Re-sampler for feature extraction, with a built-in `Context Bridge' to explicitly capture the intricate interdependency between foreground and background. Moreover, we reinforce their connection by employing a foreground-aware attention mechanism during the generation of background features, thereby enhancing the plausibility of the synthesized context. Extensive experiments show that CC-Diff outperforms state-of-the-art methods across critical quality metrics, excelling in the RS domain and effectively generalizing to natural images. Remarkably, CC-Diff also shows high trainability, boosting detection accuracy by 1.83 mAP on DOTA and 2.25 mAP on the COCO benchmark.

CC-Diff: Enhancing Contextual Coherence in Remote Sensing Image Synthesis

TL;DR

CC-Diff tackles the pivotal problem of foreground-background contextual coherence in remote sensing image synthesis by introducing a cross-modular, diffusion-based framework. It utilizes a Dual Re-sampler to extract FG and BG semantics with a Context Bridge that ties foreground context into background feature extraction, and a CGM that renders FG in parallel while FG-aware BG blending enhances plausibility. Empirical results show CC-Diff achieves state-of-the-art fidelity and faithfulness on RS datasets and generalizes effectively to natural images, while also improving detection performance when used for data augmentation (e.g., + to AP). The approach demonstrates strong trainability and practical impact for RS analysis and downstream tasks, and it supports LLM-guided layout guidance to further boost performance and diversity.

Abstract

Existing image synthesis methods for natural scenes focus primarily on foreground control, often reducing the background to simplistic textures. Consequently, these approaches tend to overlook the intrinsic correlation between foreground and background, which may lead to incoherent and unrealistic synthesis results in remote sensing (RS) scenarios. In this paper, we introduce CC-Diff, a usion Model-based approach for RS image generation with enhanced ontext oherence. Specifically, we propose a novel Dual Re-sampler for feature extraction, with a built-in `Context Bridge' to explicitly capture the intricate interdependency between foreground and background. Moreover, we reinforce their connection by employing a foreground-aware attention mechanism during the generation of background features, thereby enhancing the plausibility of the synthesized context. Extensive experiments show that CC-Diff outperforms state-of-the-art methods across critical quality metrics, excelling in the RS domain and effectively generalizing to natural images. Remarkably, CC-Diff also shows high trainability, boosting detection accuracy by 1.83 mAP on DOTA and 2.25 mAP on the COCO benchmark.

Paper Structure

This paper contains 27 sections, 7 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Comparison of (a) parallel and (b) conditional generation pipeline. The conditioning mechanism (denoted by red arrows) enhances the contextual conherence of generation results.
  • Figure 2: Illustration of contextual incoherencies in RS images synthesized by zhou2024migc. Layouts are shown in the top-right corner, object classes are labeled below with quotation marks, and incoherencies are highlighted with dashed yellow boxes.
  • Figure 3: The framework of CC-Diff. The Dual Re-sampler extracts condition embeddings from user inputs (bounding boxes, class labels, and descriptions), guiding the Conditional Generation Module (CGM) to produce contextually coherent outputs.
  • Figure 4: The architecture of Dual Re-sampler. The context query $\mathbf{q^{ctx}}$ obtains contextual semantics of FG objects from the FG Re-sampler, then incorporates them into BG feature extraction within the BG Re-sampler, thereby establishing the FG-BG association.
  • Figure 5: The architecture of Conditional Generation Module (CGM). The BG feature is rendered using the fused FG representation $\mathbf{R^{fused}}$, ensuring FG-awareness throughout generation.
  • ...and 8 more figures