Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All
Mehmet Saygin Seyfioglu, Karim Bouyarmane, Suren Kumar, Amir Tavanaei, Ismail B. Tutar
TL;DR
Diffuse to Choose (DTC) advances Vit-All by embedding fine-grained reference details into a latent diffusion inpainting framework via a secondary U-Net and FiLM-based fusion, enabling zero-shot, high-fidelity insertion of products into in-the-wild scenes. The method constructs a pixel-level hint from the reference image, aligns it with latent representations through an adapter, and fuses it with the main U-Net, aided by a perceptual loss to preserve color and texture. Empirical results show DTC surpasses image-conditioned PBE variants and matches or nears few-shot methods like DreamPaint in fidelity and integration, while delivering real-time-like inference speeds on standard GPUs. This approach broadens practical Vit-All applicability, enabling semantic edits with detailed preservation, robust in-the-wild performance, and iterative, user-guided inpainting workflows.
Abstract
As online shopping is growing, the ability for buyers to virtually visualize products in their settings-a phenomenon we define as "Virtual Try-All"-has become crucial. Recent diffusion models inherently contain a world model, rendering them suitable for this task within an inpainting context. However, traditional image-conditioned diffusion models often fail to capture the fine-grained details of products. In contrast, personalization-driven models such as DreamPaint are good at preserving the item's details but they are not optimized for real-time applications. We present "Diffuse to Choose," a novel diffusion-based image-conditioned inpainting model that efficiently balances fast inference with the retention of high-fidelity details in a given reference item while ensuring accurate semantic manipulations in the given scene content. Our approach is based on incorporating fine-grained features from the reference image directly into the latent feature maps of the main diffusion model, alongside with a perceptual loss to further preserve the reference item's details. We conduct extensive testing on both in-house and publicly available datasets, and show that Diffuse to Choose is superior to existing zero-shot diffusion inpainting methods as well as few-shot diffusion personalization algorithms like DreamPaint.
