Table of Contents
Fetching ...

Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

Fulvio Sanguigni, Davide Morelli, Marcella Cornia, Rita Cucchiara

TL;DR

Fashion-RAG tackles the challenge of multimodal fashion image editing without requiring a garment input by introducing a retrieval-augmented generation framework. It retrieves multiple garments matching a textual query and integrates their visual attributes into a diffusion-based inpainting pipeline through textual inversion, aligning retrieved content with CLIP embeddings to condition the generation. The method demonstrates superior realism and fidelity on the Dress Code dataset, outperforming state-of-the-art baselines in both paired and unpaired settings, and offers insights into how the number of retrieved items and description richness affect quality. By embedding external garment information directly into the generation process, Fashion-RAG broadens controllability and personalization in fashion AI applications with practical impact for virtual try-on and image editing workflows.

Abstract

In recent years, the fashion industry has increasingly adopted AI technologies to enhance customer experience, driven by the proliferation of e-commerce platforms and virtual applications. Among the various tasks, virtual try-on and multimodal fashion image editing -- which utilizes diverse input modalities such as text, garment sketches, and body poses -- have become a key area of research. Diffusion models have emerged as a leading approach for such generative tasks, offering superior image quality and diversity. However, most existing virtual try-on methods rely on having a specific garment input, which is often impractical in real-world scenarios where users may only provide textual specifications. To address this limitation, in this work we introduce Fashion Retrieval-Augmented Generation (Fashion-RAG), a novel method that enables the customization of fashion items based on user preferences provided in textual form. Our approach retrieves multiple garments that match the input specifications and generates a personalized image by incorporating attributes from the retrieved items. To achieve this, we employ textual inversion techniques, where retrieved garment images are projected into the textual embedding space of the Stable Diffusion text encoder, allowing seamless integration of retrieved elements into the generative process. Experimental results on the Dress Code dataset demonstrate that Fashion-RAG outperforms existing methods both qualitatively and quantitatively, effectively capturing fine-grained visual details from retrieved garments. To the best of our knowledge, this is the first work to introduce a retrieval-augmented generation approach specifically tailored for multimodal fashion image editing.

Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

TL;DR

Fashion-RAG tackles the challenge of multimodal fashion image editing without requiring a garment input by introducing a retrieval-augmented generation framework. It retrieves multiple garments matching a textual query and integrates their visual attributes into a diffusion-based inpainting pipeline through textual inversion, aligning retrieved content with CLIP embeddings to condition the generation. The method demonstrates superior realism and fidelity on the Dress Code dataset, outperforming state-of-the-art baselines in both paired and unpaired settings, and offers insights into how the number of retrieved items and description richness affect quality. By embedding external garment information directly into the generation process, Fashion-RAG broadens controllability and personalization in fashion AI applications with practical impact for virtual try-on and image editing workflows.

Abstract

In recent years, the fashion industry has increasingly adopted AI technologies to enhance customer experience, driven by the proliferation of e-commerce platforms and virtual applications. Among the various tasks, virtual try-on and multimodal fashion image editing -- which utilizes diverse input modalities such as text, garment sketches, and body poses -- have become a key area of research. Diffusion models have emerged as a leading approach for such generative tasks, offering superior image quality and diversity. However, most existing virtual try-on methods rely on having a specific garment input, which is often impractical in real-world scenarios where users may only provide textual specifications. To address this limitation, in this work we introduce Fashion Retrieval-Augmented Generation (Fashion-RAG), a novel method that enables the customization of fashion items based on user preferences provided in textual form. Our approach retrieves multiple garments that match the input specifications and generates a personalized image by incorporating attributes from the retrieved items. To achieve this, we employ textual inversion techniques, where retrieved garment images are projected into the textual embedding space of the Stable Diffusion text encoder, allowing seamless integration of retrieved elements into the generative process. Experimental results on the Dress Code dataset demonstrate that Fashion-RAG outperforms existing methods both qualitatively and quantitatively, effectively capturing fine-grained visual details from retrieved garments. To the best of our knowledge, this is the first work to introduce a retrieval-augmented generation approach specifically tailored for multimodal fashion image editing.

Paper Structure

This paper contains 11 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the proposed retrieval-augmented multimodal fashion image editing framework. The model leverages a diffusion-based inpainting pipeline, taking as input a masked reference image, a pose map, a binary mask indicating the editable region, and multimodal conditioning signals, including text descriptions and retrieved garments. Retrieved garments are projected into the CLIP textual space and combined with the textual embeddings to enhance the U-Net cross-attention mechanism. The U-Net iteratively denoises the latent representation over multiple steps, and the VAE decoder generates the final image.
  • Figure 2: Visual comparison between our work (latest to the right) with other multimodal competitors. Previous methods struggle to adhere to some types textual inputs, and we show how this method sticks to them, such as rendering correct garment length (raw 1) without artifacts, reproducing fine-grained pattern and fabric textures (raw 2), generate correct color and additional objects (such as the belt, raw 3).
  • Figure 3: Qualitative comparison between images generated with and without retrieval augmentation, along with the top-3 retrieved garments.
  • Figure 4: Sample failure case results showing the limitations of retrieval-augmented generation.