Table of Contents
Fetching ...

Automated Virtual Product Placement and Assessment in Images using Diffusion Models

Mohammad Mahmudul Alam, Negin Sokhandan, Emmett Goodman

TL;DR

The paper addresses automated virtual product placement in images by introducing a three-stage diffusion-based pipeline: (i) language-guided segmentation to select placement regions, (ii) DreamBooth-fine-tuned Stable Diffusion for product inpainting, and (iii) an Alignment Module that discriminatively filters outputs to ensure the product appears correctly. The Alignment Module combines Content, Quality, and Volume checks, using CLIP-based and captioning signals, with mask-size control via erosion/dilation, and achieves a reported $35\%$ rise in image quality and zero product-missing outputs in experiments on two products. The approach is validated through extensive quantitative metrics (Content, Quality, Volume, MAQS, MQS, MASS) and qualitative comparisons against Paint-By-Example, demonstrating stronger product likeness and consistent output quality. A SageMaker-based web app demonstrates end-to-end deployment and highlights practical considerations for scaling DreamBooth-based customization to large product catalogs in virtual advertising contexts.

Abstract

In Virtual Product Placement (VPP) applications, the discrete integration of specific brand products into images or videos has emerged as a challenging yet important task. This paper introduces a novel three-stage fully automated VPP system. In the first stage, a language-guided image segmentation model identifies optimal regions within images for product inpainting. In the second stage, Stable Diffusion (SD), fine-tuned with a few example product images, is used to inpaint the product into the previously identified candidate regions. The final stage introduces an "Alignment Module", which is designed to effectively sieve out low-quality images. Comprehensive experiments demonstrate that the Alignment Module ensures the presence of the intended product in every generated image and enhances the average quality of images by 35%. The results presented in this paper demonstrate the effectiveness of the proposed VPP system, which holds significant potential for transforming the landscape of virtual advertising and marketing strategies.

Automated Virtual Product Placement and Assessment in Images using Diffusion Models

TL;DR

The paper addresses automated virtual product placement in images by introducing a three-stage diffusion-based pipeline: (i) language-guided segmentation to select placement regions, (ii) DreamBooth-fine-tuned Stable Diffusion for product inpainting, and (iii) an Alignment Module that discriminatively filters outputs to ensure the product appears correctly. The Alignment Module combines Content, Quality, and Volume checks, using CLIP-based and captioning signals, with mask-size control via erosion/dilation, and achieves a reported rise in image quality and zero product-missing outputs in experiments on two products. The approach is validated through extensive quantitative metrics (Content, Quality, Volume, MAQS, MQS, MASS) and qualitative comparisons against Paint-By-Example, demonstrating stronger product likeness and consistent output quality. A SageMaker-based web app demonstrates end-to-end deployment and highlights practical considerations for scaling DreamBooth-based customization to large product catalogs in virtual advertising contexts.

Abstract

In Virtual Product Placement (VPP) applications, the discrete integration of specific brand products into images or videos has emerged as a challenging yet important task. This paper introduces a novel three-stage fully automated VPP system. In the first stage, a language-guided image segmentation model identifies optimal regions within images for product inpainting. In the second stage, Stable Diffusion (SD), fine-tuned with a few example product images, is used to inpaint the product into the previously identified candidate regions. The final stage introduces an "Alignment Module", which is designed to effectively sieve out low-quality images. Comprehensive experiments demonstrate that the Alignment Module ensures the presence of the intended product in every generated image and enhances the average quality of images by 35%. The results presented in this paper demonstrate the effectiveness of the proposed VPP system, which holds significant potential for transforming the landscape of virtual advertising and marketing strategies.
Paper Structure (18 sections, 8 figures, 2 tables)

This paper contains 18 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: An illustration of the proposed VPP system with an Amazon Echo Dot device. The input background image is shown in (a), and the inpainted output image is shown in (b) where an Amazon Echo Dot device is placed on the kitchen countertop by automatic identification of optimal location.
  • Figure 2: The block diagram of the proposed solution for the VPP system where each of the three stages is distinguished by varied color blocks. In stage 1, a suitable placement for product inpainting is determined by creating a mask using CLIPSeg and VILT models. Next, in stage 2, semantic inpainting is performed in the masked area using the fine-tuned DreamBooth model. Finally, stage 3 contains the cascaded sub-modules of the Alignment Module to discard low-quality images.
  • Figure 3: Block diagram of each of the components of the Alignment Module. The Content sub-module is built using a pre-trained caption generator and CLIP models shown in (a). The generated caption is fine-tuned by adding the name of the intended product to the caption. For the Quality sub-module, the image features of the same CLIP model are utilized shown in (b). Finally, in the Volume sub-module, the same CLIP model with three different size text prompts is used shown in (c).
  • Figure 4: Application of erosion to the mask where a kernel of size $(5\times5)$ is used for 0, 10, 20, and 25 iterations shown in the figure consecutively. The resulting output is presented at the bottom of the corresponding mask to show the size reduction of the generated product in the output image.
  • Figure 5: Inpainted product image of Paint-by-Example (PBE). PBE generates high-quality images which explains the higher CLIP score in the case of Lupure Vitamin C. However, the inpainted product does not look similar to the desired product at all resulting in very poor mean assigned quality and size scores. Output images for Amazon Echo Dot is shown in (a) and (b), and for Lupure Vitamin C is shown in (c) and (d).
  • ...and 3 more figures