HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

Yichen Liu; Donghao Zhou; Jie Wang; Xin Gao; Guisheng Liu; Jiatong Li; Quanwei Zhang; Qiang Lyu; Lanqing Guo; Shilei Wen; Weiqiang Wang; Pheng-Ann Heng

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

Yichen Liu, Donghao Zhou, Jie Wang, Xin Gao, Guisheng Liu, Jiatong Li, Quanwei Zhang, Qiang Lyu, Lanqing Guo, Shilei Wen, Weiqiang Wang, Pheng-Ann Heng

TL;DR

HiFi-Inpaint is proposed, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images that achieves state-of-the-art performance, delivering detail-preserving human-product images.

Abstract

Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-based inpainting offers a targeted solution by leveraging product reference images to guide the inpainting process. However, limitations remain in three key aspects: the lack of diverse large-scale training data, the struggle of current models to focus on product detail preservation, and the inability of coarse supervision for achieving precise guidance. To address these issues, we propose HiFi-Inpaint, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images. HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps. Additionally, we construct a new dataset, HP-Image-40K, with samples curated from self-synthesis data and processed with automatic filtering. Experimental results show that HiFi-Inpaint achieves state-of-the-art performance, delivering detail-preserving human-product images.

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

TL;DR

Abstract

Paper Structure (24 sections, 5 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 5 equations, 12 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Text-to-Image Generation
Image Inpainting
Methodology
Overview
Dataset Construction: HP-Image-40K
High-Frequency Map-Guided DiT
Detail-Aware Training Strategy
Experiments
Setups
Quantitative Comparison
Qualitative Comparison
User Study
Ablation Analysis
...and 9 more sections

Figures (12)

Figure 1: HiFi-Inpaint enables high-fidelity reference-based inpainting. Our HiFi-Inpaint can seamlessly integrate product reference images into masked human images, generating high-quality human-product images with high-fidelity detail preservation. To avoid potential privacy and copyright concerns, we use AI-generated products and humans for presentation purposes in this paper. Zoom in for better view.
Figure 1: High-Frequency Extraction
Figure 2: Overview of HiFi-Inpaint. HiFi-Inpaint is a high-fidelity reference-based inpainting framework tailored for generating human-product images. To support model training, we construct HP-Image-40K, a large-scale dataset of human-product images, collected through a self-synthesis pipeline combined with automated filtering to ensure high-quality samples (Sec. \ref{['sec:data']}). Furthermore, we introduce two key techniques: (i) Shared Enhancement Attention (SEA), designed to refine fine-grained product features by leveraging high-frequency map tokens within dual-stream visual DiT blocks (Sec. \ref{['sec:model']}), and (ii) Detail-Aware Loss (DAL), developed to enforce precise pixel-level supervision by utilizing high-frequency information, enabling the reconstruction of intricate product and human details (Sec. \ref{['sec:training']}).
Figure 3: Comparison with the Canny algorithm. While Canny detects all edges, leading to significant background clutter (red frame), the adopted algorithm highlights key elements like text and logos (blue frame), by being responsive to specific frequencies.
Figure 4: Comparison with fixed weighting of SEA. Adopting a learnable weighting factor produces more harmonious and realistic results, whereas using a fixed one often leads to visual artifacts and conflicts across the inpainting region.
...and 7 more figures

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

TL;DR

Abstract

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

Authors

TL;DR

Abstract

Table of Contents

Figures (12)