Table of Contents
Fetching ...

Preserving Product Fidelity in Large Scale Image Recontextualization with Diffusion Models

Ishaan Malhi, Praneet Dutta, Ellie Talius, Sally Ma, Brendan Driscoll, Krista Holden, Garima Pruthi, Arunachalam Narayanaswamy

TL;DR

This work tackles the fidelity gap in product recontextualization by introducing a diffusion-based framework augmented with synthetic data pipelines. It combines novel view generation, background disentanglement via outpainting, and negative counterfactuals, followed by captioning, data filtering, and LoRA-based finetuning to preserve product details across diverse contexts. Post-finetuning ranking using multimodal embeddings selects high-quality generations, achieving higher human- and metric-based fidelity than baselines and enabling realistic relighting, occlusions, and novel viewpoints at scale. The approach demonstrates strong performance on ABO and a private dataset, offering practical implications for e-commerce and virtual product showcasing without requiring extensive model surgery. Overall, the paper advances scalable, high-fidelity product recontextualization by tightly integrating data augmentation, perceptual alignment, and efficient finetuning strategies.

Abstract

We present a framework for high-fidelity product image recontextualization using text-to-image diffusion models and a novel data augmentation pipeline. This pipeline leverages image-to-video diffusion, in/outpainting & negatives to create synthetic training data, addressing limitations of real-world data collection for this task. Our method improves the quality and diversity of generated images by disentangling product representations and enhancing the model's understanding of product characteristics. Evaluation on the ABO dataset and a private product dataset, using automated metrics and human assessment, demonstrates the effectiveness of our framework in generating realistic and compelling product visualizations, with implications for applications such as e-commerce and virtual product showcasing.

Preserving Product Fidelity in Large Scale Image Recontextualization with Diffusion Models

TL;DR

This work tackles the fidelity gap in product recontextualization by introducing a diffusion-based framework augmented with synthetic data pipelines. It combines novel view generation, background disentanglement via outpainting, and negative counterfactuals, followed by captioning, data filtering, and LoRA-based finetuning to preserve product details across diverse contexts. Post-finetuning ranking using multimodal embeddings selects high-quality generations, achieving higher human- and metric-based fidelity than baselines and enabling realistic relighting, occlusions, and novel viewpoints at scale. The approach demonstrates strong performance on ABO and a private dataset, offering practical implications for e-commerce and virtual product showcasing without requiring extensive model surgery. Overall, the paper advances scalable, high-fidelity product recontextualization by tightly integrating data augmentation, perceptual alignment, and efficient finetuning strategies.

Abstract

We present a framework for high-fidelity product image recontextualization using text-to-image diffusion models and a novel data augmentation pipeline. This pipeline leverages image-to-video diffusion, in/outpainting & negatives to create synthetic training data, addressing limitations of real-world data collection for this task. Our method improves the quality and diversity of generated images by disentangling product representations and enhancing the model's understanding of product characteristics. Evaluation on the ABO dataset and a private product dataset, using automated metrics and human assessment, demonstrates the effectiveness of our framework in generating realistic and compelling product visualizations, with implications for applications such as e-commerce and virtual product showcasing.

Paper Structure

This paper contains 18 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Given a few input images of a real world product, our system can generate images that not only maintain high fidelity to the original product, but also recontextualize it in novel settings beyond background changes: from showcasing it in a new perspective, adding object occlusions, to creating different and realistic lighting conditions.
  • Figure 2: Novel view generated using image-to-video diffusion. Left: Input image. Right: Generated view.
  • Figure 3: Samples of image positives for a given table, along with it's object mask used for background replacement/outpainting.
  • Figure :