Table of Contents
Fetching ...

Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All

Mehmet Saygin Seyfioglu, Karim Bouyarmane, Suren Kumar, Amir Tavanaei, Ismail B. Tutar

TL;DR

Diffuse to Choose (DTC) advances Vit-All by embedding fine-grained reference details into a latent diffusion inpainting framework via a secondary U-Net and FiLM-based fusion, enabling zero-shot, high-fidelity insertion of products into in-the-wild scenes. The method constructs a pixel-level hint from the reference image, aligns it with latent representations through an adapter, and fuses it with the main U-Net, aided by a perceptual loss to preserve color and texture. Empirical results show DTC surpasses image-conditioned PBE variants and matches or nears few-shot methods like DreamPaint in fidelity and integration, while delivering real-time-like inference speeds on standard GPUs. This approach broadens practical Vit-All applicability, enabling semantic edits with detailed preservation, robust in-the-wild performance, and iterative, user-guided inpainting workflows.

Abstract

As online shopping is growing, the ability for buyers to virtually visualize products in their settings-a phenomenon we define as "Virtual Try-All"-has become crucial. Recent diffusion models inherently contain a world model, rendering them suitable for this task within an inpainting context. However, traditional image-conditioned diffusion models often fail to capture the fine-grained details of products. In contrast, personalization-driven models such as DreamPaint are good at preserving the item's details but they are not optimized for real-time applications. We present "Diffuse to Choose," a novel diffusion-based image-conditioned inpainting model that efficiently balances fast inference with the retention of high-fidelity details in a given reference item while ensuring accurate semantic manipulations in the given scene content. Our approach is based on incorporating fine-grained features from the reference image directly into the latent feature maps of the main diffusion model, alongside with a perceptual loss to further preserve the reference item's details. We conduct extensive testing on both in-house and publicly available datasets, and show that Diffuse to Choose is superior to existing zero-shot diffusion inpainting methods as well as few-shot diffusion personalization algorithms like DreamPaint.

Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All

TL;DR

Diffuse to Choose (DTC) advances Vit-All by embedding fine-grained reference details into a latent diffusion inpainting framework via a secondary U-Net and FiLM-based fusion, enabling zero-shot, high-fidelity insertion of products into in-the-wild scenes. The method constructs a pixel-level hint from the reference image, aligns it with latent representations through an adapter, and fuses it with the main U-Net, aided by a perceptual loss to preserve color and texture. Empirical results show DTC surpasses image-conditioned PBE variants and matches or nears few-shot methods like DreamPaint in fidelity and integration, while delivering real-time-like inference speeds on standard GPUs. This approach broadens practical Vit-All applicability, enabling semantic edits with detailed preservation, robust in-the-wild performance, and iterative, user-guided inpainting workflows.

Abstract

As online shopping is growing, the ability for buyers to virtually visualize products in their settings-a phenomenon we define as "Virtual Try-All"-has become crucial. Recent diffusion models inherently contain a world model, rendering them suitable for this task within an inpainting context. However, traditional image-conditioned diffusion models often fail to capture the fine-grained details of products. In contrast, personalization-driven models such as DreamPaint are good at preserving the item's details but they are not optimized for real-time applications. We present "Diffuse to Choose," a novel diffusion-based image-conditioned inpainting model that efficiently balances fast inference with the retention of high-fidelity details in a given reference item while ensuring accurate semantic manipulations in the given scene content. Our approach is based on incorporating fine-grained features from the reference image directly into the latent feature maps of the main diffusion model, alongside with a perceptual loss to further preserve the reference item's details. We conduct extensive testing on both in-house and publicly available datasets, and show that Diffuse to Choose is superior to existing zero-shot diffusion inpainting methods as well as few-shot diffusion personalization algorithms like DreamPaint.
Paper Structure (19 sections, 1 equation, 18 figures, 3 tables)

This paper contains 19 sections, 1 equation, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Diffuse to Choose (DTC) allows users to virtually place any e-commerce item in any setting, ensuring detailed, semantically coherent blending with realistic lighting and shadows.
  • Figure 2: Pipeline of Diffuse to Choose. The process initiates with masking the source image, followed by inserting the reference image within the masked area. This pixel-level 'hint' is then adapted by a shallow CNN to align with the VAE output dimensions of the source image. Subsequently, a U-Net Encoder processes the adapted hint. Then, at each U-Net scale, a FiLM module affinely aligns the skip-connected features from the main U-Net Encoder and pixel-level features from the hint U-Net Encoder. Finally, these aligned feature maps, in conjunction with the main image conditioning, facilitate the inpainting of the masked region. Red indicates trainable modules and blue indicates frozen modules.
  • Figure 3: Pipeline of Paint by Example yang2023paint for Vit-All case. Red are trainable and blue are frozen. Orange pathways indicate skip connections between the encoder and the decoder.
  • Figure 4: Hint signal is stitched into a blank image within the masked region, then summed up with latent masked input before fed into Auxiliary U-Net.
  • Figure 5: DTC can handle variety of e-commerce products and can generate images using in-the-wild images & references.
  • ...and 13 more figures