Table of Contents
Fetching ...

FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model

Kaicheng Pang, Xingxing Zou, Waikeung Wong

TL;DR

FashionM3 addresses interactive, personalized fashion recommendations by integrating a unified vision-language model with multimodal inputs and iterative dialogue. It introduces FashionVLM fine-tuned on the FashionRec dataset to support multitask outputs including recommendations and image generation, governed by the Model Context Protocol MCP. The paper demonstrates superior semantic alignment and personalization against baselines, supported by quantitative metrics and a user study that highlights practical value and user experience. The work advances practical, end-to-end fashion assistants capable of real-time refinement, visualization, and alternatives, offering significant potential for enhanced online styling workflows.

Abstract

Fashion styling and personalized recommendations are pivotal in modern retail, contributing substantial economic value in the fashion industry. With the advent of vision-language models (VLM), new opportunities have emerged to enhance retailing through natural language and visual interactions. This work proposes FashionM3, a multimodal, multitask, and multiround fashion assistant, built upon a VLM fine-tuned for fashion-specific tasks. It helps users discover satisfying outfits by offering multiple capabilities including personalized recommendation, alternative suggestion, product image generation, and virtual try-on simulation. Fine-tuned on the novel FashionRec dataset, comprising 331,124 multimodal dialogue samples across basic, personalized, and alternative recommendation tasks, FashionM3 delivers contextually personalized suggestions with iterative refinement through multiround interactions. Quantitative and qualitative evaluations, alongside user studies, demonstrate FashionM3's superior performance in recommendation effectiveness and practical value as a fashion assistant.

FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model

TL;DR

FashionM3 addresses interactive, personalized fashion recommendations by integrating a unified vision-language model with multimodal inputs and iterative dialogue. It introduces FashionVLM fine-tuned on the FashionRec dataset to support multitask outputs including recommendations and image generation, governed by the Model Context Protocol MCP. The paper demonstrates superior semantic alignment and personalization against baselines, supported by quantitative metrics and a user study that highlights practical value and user experience. The work advances practical, end-to-end fashion assistants capable of real-time refinement, visualization, and alternatives, offering significant potential for enhanced online styling workflows.

Abstract

Fashion styling and personalized recommendations are pivotal in modern retail, contributing substantial economic value in the fashion industry. With the advent of vision-language models (VLM), new opportunities have emerged to enhance retailing through natural language and visual interactions. This work proposes FashionM3, a multimodal, multitask, and multiround fashion assistant, built upon a VLM fine-tuned for fashion-specific tasks. It helps users discover satisfying outfits by offering multiple capabilities including personalized recommendation, alternative suggestion, product image generation, and virtual try-on simulation. Fine-tuned on the novel FashionRec dataset, comprising 331,124 multimodal dialogue samples across basic, personalized, and alternative recommendation tasks, FashionM3 delivers contextually personalized suggestions with iterative refinement through multiround interactions. Quantitative and qualitative evaluations, alongside user studies, demonstrate FashionM3's superior performance in recommendation effectiveness and practical value as a fashion assistant.

Paper Structure

This paper contains 21 sections, 5 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Three examples from the FashionRec dataset illustrate its three task types. Blue and green text highlight key elements of the human and assistant dialogue, respectively, generated by a large language model based on provided outfit and recommendation scenarios. Basic recommendation suggests a target item (e.g., blue jeans) to complete a partial outfit (e.g., t-shirt and sneakers), both derived from the same human-curated outfit. Personalized recommendation incorporates the user's interaction history (e.g., user prefers light colors) to delieve personalized recommendations (light blue straight-leg jeans). Alternative recommendation suggests an alternative item (e.g., denim shorts) to replace a fashion piece of the same category (e.g., pants) in an outfit, using two human-curated outfits sharing at least two common items.
  • Figure 2: Training pipeline of FashionVLM, showcasing the multimodal recommendation and text-to-image generation tasks. In the multimodal recommendation task, a user provides an outfit and a preference for jeans, receiving a recommendation for dark-wash skinny jeans. In the text-to-image generation task, the model generates an image of high-heeled black leather ankle boots based on a provided textual description. Inputs are processed by a text tokenizer and an image tokenizer, which convert queries and images into text and image tokens for FashionVLM. Outputs are then transformed back into textual and visual formats using a text de-tokenizer and an image de-tokenizer.
  • Figure 3: Overview of FashionM3's architecture, orchestrating the flow of information across key components: multimodal query understanding, user data retrieval, FashionVLM for recommendation and image generation, and integration of external tools.
  • Figure 4: Qualitative results of generated images for personalized recommendation task.
  • Figure 5: Qualitative results of generated images for alternative recommendation task.
  • ...and 5 more figures