Preserving Product Fidelity in Large Scale Image Recontextualization with Diffusion Models
Ishaan Malhi, Praneet Dutta, Ellie Talius, Sally Ma, Brendan Driscoll, Krista Holden, Garima Pruthi, Arunachalam Narayanaswamy
TL;DR
This work tackles the fidelity gap in product recontextualization by introducing a diffusion-based framework augmented with synthetic data pipelines. It combines novel view generation, background disentanglement via outpainting, and negative counterfactuals, followed by captioning, data filtering, and LoRA-based finetuning to preserve product details across diverse contexts. Post-finetuning ranking using multimodal embeddings selects high-quality generations, achieving higher human- and metric-based fidelity than baselines and enabling realistic relighting, occlusions, and novel viewpoints at scale. The approach demonstrates strong performance on ABO and a private dataset, offering practical implications for e-commerce and virtual product showcasing without requiring extensive model surgery. Overall, the paper advances scalable, high-fidelity product recontextualization by tightly integrating data augmentation, perceptual alignment, and efficient finetuning strategies.
Abstract
We present a framework for high-fidelity product image recontextualization using text-to-image diffusion models and a novel data augmentation pipeline. This pipeline leverages image-to-video diffusion, in/outpainting & negatives to create synthetic training data, addressing limitations of real-world data collection for this task. Our method improves the quality and diversity of generated images by disentangling product representations and enhancing the model's understanding of product characteristics. Evaluation on the ABO dataset and a private product dataset, using automated metrics and human assessment, demonstrates the effectiveness of our framework in generating realistic and compelling product visualizations, with implications for applications such as e-commerce and virtual product showcasing.
