Table of Contents
Fetching ...

CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting

Chae-Yeon Heo, Yeong-Jun Cho

TL;DR

CSF-Net tackles large-mask inpainting by supplying semantic priors through an amodal completion model and fusing them with contextual features via a dual-encoder Swin Transformer to produce a semantic guidance image. This guidance reduces object hallucination and improves structural and semantic fidelity across diverse masks and datasets, while requiring no changes to existing inpainting architectures. The approach combines structure-aware candidate generation, transformer-based fusion, and hierarchical pixel selection with carefully designed losses to ensure cross-scale consistency. Empirical results on Places365 and COCOA show robust improvements over state-of-the-art baselines, highlighting the method's practicality and scalability for real-world inpainting tasks.

Abstract

In this paper, we propose a semantic-guided framework to address the challenging problem of large-mask image inpainting, where essential visual content is missing and contextual cues are limited. To compensate for the limited context, we leverage a pretrained Amodal Completion (AC) model to generate structure-aware candidates that serve as semantic priors for the missing regions. We introduce Context-Semantic Fusion Network (CSF-Net), a transformer-based fusion framework that fuses these candidates with contextual features to produce a semantic guidance image for image inpainting. This guidance improves inpainting quality by promoting structural accuracy and semantic consistency. CSF-Net can be seamlessly integrated into existing inpainting models without architectural changes and consistently enhances performance across diverse masking conditions. Extensive experiments on the Places365 and COCOA datasets demonstrate that CSF-Net effectively reduces object hallucination while enhancing visual realism and semantic alignment. The code for CSF-Net is available at https://github.com/chaeyeonheo/CSF-Net.

CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting

TL;DR

CSF-Net tackles large-mask inpainting by supplying semantic priors through an amodal completion model and fusing them with contextual features via a dual-encoder Swin Transformer to produce a semantic guidance image. This guidance reduces object hallucination and improves structural and semantic fidelity across diverse masks and datasets, while requiring no changes to existing inpainting architectures. The approach combines structure-aware candidate generation, transformer-based fusion, and hierarchical pixel selection with carefully designed losses to ensure cross-scale consistency. Empirical results on Places365 and COCOA show robust improvements over state-of-the-art baselines, highlighting the method's practicality and scalability for real-world inpainting tasks.

Abstract

In this paper, we propose a semantic-guided framework to address the challenging problem of large-mask image inpainting, where essential visual content is missing and contextual cues are limited. To compensate for the limited context, we leverage a pretrained Amodal Completion (AC) model to generate structure-aware candidates that serve as semantic priors for the missing regions. We introduce Context-Semantic Fusion Network (CSF-Net), a transformer-based fusion framework that fuses these candidates with contextual features to produce a semantic guidance image for image inpainting. This guidance improves inpainting quality by promoting structural accuracy and semantic consistency. CSF-Net can be seamlessly integrated into existing inpainting models without architectural changes and consistently enhances performance across diverse masking conditions. Extensive experiments on the Places365 and COCOA datasets demonstrate that CSF-Net effectively reduces object hallucination while enhancing visual realism and semantic alignment. The code for CSF-Net is available at https://github.com/chaeyeonheo/CSF-Net.

Paper Structure

This paper contains 19 sections, 13 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Large-mask inpainting comparison. Existing methods Wang_2025_CVPRChen_2024_CVPR often produce structural errors or object hallucinations in challenging cases. Incorporating our semantic guidance ($\mathbf{I}_{\text{guide}}$) yields more accurate and semantically coherent inpainting results.
  • Figure 2: Overview of semantic guidance image ($\mathbf{I}_{\text{guide}}$) generation of CSF-Net. This image incorporates object-level semantic priors and serves as an input to the inpainting model.
  • Figure 3: Overview of the proposed CSF-Net. (a) A pretrained amodal completion model generates multiple object completions, and context-inconsistent candidates are filtered out. (b) Dual Swin-Transformer encoders extract multi-scale features from the masked image and selected candidates, which are fused via a cross-attention fusion decoder. (c) Hierarchical pixel selection is performed using structural and perceptual scores to generate the final semantic guidance image $\mathbf{I}_{\text{guide}}$.
  • Figure 4: Hierarchical Pixel Selection in the CSF-Net. (a) The Structure Score Network (SSN) and Perceptual Score Network (PSN) compute confidence scores at each scale using fused features and the masked input. Multi-scale consistency is enforced via learnable coefficients $\beta$. (b) At the finest scale, the highest-scoring candidate is selected for each pixel to form the semantic guidance image $\mathbf{I}_{\text{guide}}$.
  • Figure 5: Comprehensive qualitative comparison under different mask configurations using the Places365zhou2017places (evaluated on all three mask types) and COCOAzhu2017semantic (evaluated on Center Box 80% and RandomBrush 50--80%).Our CSF-Net consistently generates clearer and more coherent results compared to baseline methods across diverse scenes and mask types, effectively reducing object hallucination.