FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model
Kaicheng Pang, Xingxing Zou, Waikeung Wong
TL;DR
FashionM3 addresses interactive, personalized fashion recommendations by integrating a unified vision-language model with multimodal inputs and iterative dialogue. It introduces FashionVLM fine-tuned on the FashionRec dataset to support multitask outputs including recommendations and image generation, governed by the Model Context Protocol MCP. The paper demonstrates superior semantic alignment and personalization against baselines, supported by quantitative metrics and a user study that highlights practical value and user experience. The work advances practical, end-to-end fashion assistants capable of real-time refinement, visualization, and alternatives, offering significant potential for enhanced online styling workflows.
Abstract
Fashion styling and personalized recommendations are pivotal in modern retail, contributing substantial economic value in the fashion industry. With the advent of vision-language models (VLM), new opportunities have emerged to enhance retailing through natural language and visual interactions. This work proposes FashionM3, a multimodal, multitask, and multiround fashion assistant, built upon a VLM fine-tuned for fashion-specific tasks. It helps users discover satisfying outfits by offering multiple capabilities including personalized recommendation, alternative suggestion, product image generation, and virtual try-on simulation. Fine-tuned on the novel FashionRec dataset, comprising 331,124 multimodal dialogue samples across basic, personalized, and alternative recommendation tasks, FashionM3 delivers contextually personalized suggestions with iterative refinement through multiround interactions. Quantitative and qualitative evaluations, alongside user studies, demonstrate FashionM3's superior performance in recommendation effectiveness and practical value as a fashion assistant.
