Table of Contents
Fetching ...

InpaintDPO: Mitigating Spatial Relationship Hallucinations in Foreground-conditioned Inpainting via Diverse Preference Optimization

Qirui Li, Yizhe Tang, Ran Yi, Guangben Lu, Fangyuan Zou, Peng Shu, Huan Yu, Jie Jiang

TL;DR

<3-5 sentence high-level summary> Foreground-conditioned inpainting often suffers from Spatial Relationship Hallucinations between the foreground and generated background. The authors propose InpaintDPO, a Direct Preference Optimization framework augmented with MaskDPO to localize optimization to the background, CAPO to boost boundary coherence, and SCPO to capture shared spatial patterns, all trained on a dedicated spatial-rationality preference dataset. Empirical results show superior spatial rationality, foreground fidelity, and text alignment compared with strong baselines, along with a strong human-ELO ranking. The approach also demonstrates transferability to a separate editing model (Qwen-Image-Edit), supporting its general applicability to different diffusion-based architectures.</n>

Abstract

Foreground-conditioned inpainting, which aims at generating a harmonious background for a given foreground subject based on the text prompt, is an important subfield in controllable image generation. A common challenge in current methods, however, is the occurrence of Spatial Relationship Hallucinations between the foreground subject and the generated background, including inappropriate scale, positional relationships, and viewpoints. Critically, the subjective nature of spatial rationality makes it challenging to quantify, hindering the use of traditional reward-based RLHF methods. To address this issue, we propose InpaintDPO, the first Direct Preference Optimization (DPO) based framework dedicated to spatial rationality in foreground-conditioned inpainting, ensuring plausible spatial relationships between foreground and background elements. To resolve the gradient conflicts in standard DPO caused by identical foreground in win-lose pairs, we propose MaskDPO, which confines preference optimization exclusively to the background to enhance background spatial relationships, while retaining the inpainting loss in the foreground region for robust foreground preservation. To enhance coherence at the foreground-background boundary, we propose Conditional Asymmetric Preference Optimization, which samples pairs with differentiated cropping operations and applies global preference optimization to promote contextual awareness and enhance boundary coherence. Finally, based on the observation that winning samples share a commonality in plausible spatial relationships, we propose Shared Commonality Preference Optimization to enhance the model's understanding of spatial commonality across high-quality winning samples, further promoting shared spatial rationality.

InpaintDPO: Mitigating Spatial Relationship Hallucinations in Foreground-conditioned Inpainting via Diverse Preference Optimization

TL;DR

<3-5 sentence high-level summary> Foreground-conditioned inpainting often suffers from Spatial Relationship Hallucinations between the foreground and generated background. The authors propose InpaintDPO, a Direct Preference Optimization framework augmented with MaskDPO to localize optimization to the background, CAPO to boost boundary coherence, and SCPO to capture shared spatial patterns, all trained on a dedicated spatial-rationality preference dataset. Empirical results show superior spatial rationality, foreground fidelity, and text alignment compared with strong baselines, along with a strong human-ELO ranking. The approach also demonstrates transferability to a separate editing model (Qwen-Image-Edit), supporting its general applicability to different diffusion-based architectures.</n>

Abstract

Foreground-conditioned inpainting, which aims at generating a harmonious background for a given foreground subject based on the text prompt, is an important subfield in controllable image generation. A common challenge in current methods, however, is the occurrence of Spatial Relationship Hallucinations between the foreground subject and the generated background, including inappropriate scale, positional relationships, and viewpoints. Critically, the subjective nature of spatial rationality makes it challenging to quantify, hindering the use of traditional reward-based RLHF methods. To address this issue, we propose InpaintDPO, the first Direct Preference Optimization (DPO) based framework dedicated to spatial rationality in foreground-conditioned inpainting, ensuring plausible spatial relationships between foreground and background elements. To resolve the gradient conflicts in standard DPO caused by identical foreground in win-lose pairs, we propose MaskDPO, which confines preference optimization exclusively to the background to enhance background spatial relationships, while retaining the inpainting loss in the foreground region for robust foreground preservation. To enhance coherence at the foreground-background boundary, we propose Conditional Asymmetric Preference Optimization, which samples pairs with differentiated cropping operations and applies global preference optimization to promote contextual awareness and enhance boundary coherence. Finally, based on the observation that winning samples share a commonality in plausible spatial relationships, we propose Shared Commonality Preference Optimization to enhance the model's understanding of spatial commonality across high-quality winning samples, further promoting shared spatial rationality.

Paper Structure

This paper contains 24 sections, 21 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Our InpaintDPO generates high-quality images with plausible spatial relationships between foreground subject and background scene, while also achieving outstanding foreground consistency and text alignment.
  • Figure 2: Three categories of "Spatial Relationship Hallucinations" in foreground-conditioned inpainting, where previous methods flux2024inpaintalimama_FLUXlu2025pinco suffer from inappropriate spatial scale , spatial relationship and viewpoint. In contrast, our method can generate spatially rational images without such hallucinations.
  • Figure 3: Overview of InpaintDPO framework for foreground-conditioned inpainting. 1) Data Construction: a four-stage pipeline to build the high-contrast spatial rationality human preference dataset. 2) InpaintDPO training pipeline and three proposed preference optimization strategies: MaskDPO confines preference optimization to background, while retaining inpainting loss for foreground preservation; CAPO enhances foreground boundary coherence through global preference optimization via differentiated cropping operation; SCPO captures the shared commonality of plausible spatial relationships of winning samples by narrowing implicit reward gap.
  • Figure 4: Comparison of Standard DPO and DPO with SCPO.
  • Figure 5: Qualitative comparison of existing inpainting or editing methods and our InpaintDPO. Our method generates plausible spatially rational images without hallucinations like inappropriate scale or position, with great prompt alignment and foreground consistency.
  • ...and 15 more figures