Table of Contents
Fetching ...

UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation

Xiangyu Zhao, Yuehan Zhang, Wenlong Zhang, Xiao-Ming Wu

TL;DR

UniFashion is presented, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks.

Abstract

The fashion domain encompasses a variety of real-world multimodal tasks, including multimodal retrieval and multimodal generation. The rapid advancements in artificial intelligence generated content, particularly in technologies like large language models for text generation and diffusion models for visual generation, have sparked widespread research interest in applying these multimodal models in the fashion domain. However, tasks involving embeddings, such as image-to-text or text-to-image retrieval, have been largely overlooked from this perspective due to the diverse nature of the multimodal fashion domain. And current research on multi-task single models lack focus on image generation. In this work, we present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks. UniFashion unifies embedding and generative tasks by integrating a diffusion model and LLM, enabling controllable and high-fidelity generation. Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks, and can be readily adapted to manage complex vision-language tasks. This work demonstrates the potential learning synergy between multimodal generation and retrieval, offering a promising direction for future research in the fashion domain. The source code is available at https://github.com/xiangyu-mm/UniFashion.

UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation

TL;DR

UniFashion is presented, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks.

Abstract

The fashion domain encompasses a variety of real-world multimodal tasks, including multimodal retrieval and multimodal generation. The rapid advancements in artificial intelligence generated content, particularly in technologies like large language models for text generation and diffusion models for visual generation, have sparked widespread research interest in applying these multimodal models in the fashion domain. However, tasks involving embeddings, such as image-to-text or text-to-image retrieval, have been largely overlooked from this perspective due to the diverse nature of the multimodal fashion domain. And current research on multi-task single models lack focus on image generation. In this work, we present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks. UniFashion unifies embedding and generative tasks by integrating a diffusion model and LLM, enabling controllable and high-fidelity generation. Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks, and can be readily adapted to manage complex vision-language tasks. This work demonstrates the potential learning synergy between multimodal generation and retrieval, offering a promising direction for future research in the fashion domain. The source code is available at https://github.com/xiangyu-mm/UniFashion.
Paper Structure (38 sections, 15 equations, 8 figures, 8 tables)

This paper contains 38 sections, 15 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustration of the fashion tasks encompassed in our UniFashion framework: cross-modal retrieval, text-guided image retrieval, fashion image captioning, and fashion image generation. Model inputs highlighted with a light yellow background and outputs denoted by a light blue background.
  • Figure 2: Overview of the training framework of our UniFashion model. Phase 1 - Cross-modal Pre-training: UniFashion acquires robust cross-modal fashion representation capabilities through pre-training, leveraging both the language model and the diffusion model. Phase 2 - Composed Multimodal Fine-tuning: The model undergoes fine-tuning to process both image and text inputs, refining its ability to learn composed modal representations. This is achieved by aligning the multimodal encoder with the LLM and the diffusion model for enhanced performance.
  • Figure 3: The architecture of UniFashion for fine-tuning on the image editing task. Firstly, we supply the cloth sketch and text guidance to the multimodal encoder. Then, the diffusion model receives the output of the multimodal encoder, along with the cloth sketches and human features (i.e., agnostic-mask), to subsequently generate the desired images.
  • Figure 4: Vocabulary of the frequent words scaled by frequency for dresses.
  • Figure 5: Illustration of Instruction-Following Data. The top section displays an image alongside its original captions from Fashion-IQ dataset. The bottom section presents detailed captions generated by LLaVA-1.5. The original captions are not prompts for generation but are provided for comparison with the newly generated caption.
  • ...and 3 more figures