Table of Contents
Fetching ...

OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

Zhengwei Yang, Andi Long, Hao Li, Zechao Hu, Kui Jiang, Zheng Wang

TL;DR

FashionX is constructed, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level and proposes OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue.

Abstract

Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of consistent visual-semantic structures, preventing recent vision-language models (VLMs) from serving as a generalist fashion brain that unifies understanding and reasoning across tasks. Therefore, we construct FashionX, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level. Built upon this foundation, we propose OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue. Experiments on multi-subtasks and retrieval benchmarks show that OmniFashion achieves strong task-level accuracy and cross-task generalization, highlighting its offering of a scalable path toward universal, dialogue-oriented fashion intelligence.

OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

TL;DR

FashionX is constructed, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level and proposes OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue.

Abstract

Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of consistent visual-semantic structures, preventing recent vision-language models (VLMs) from serving as a generalist fashion brain that unifies understanding and reasoning across tasks. Therefore, we construct FashionX, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level. Built upon this foundation, we propose OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue. Experiments on multi-subtasks and retrieval benchmarks show that OmniFashion achieves strong task-level accuracy and cross-task generalization, highlighting its offering of a scalable path toward universal, dialogue-oriented fashion intelligence.
Paper Structure (29 sections, 8 equations, 8 figures, 4 tables)

This paper contains 29 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Users interacting with fashion systems require various abilities. Traditional systems address these tasks separately, while general-purpose VLMs offer generic yet shallow responses. OmniFashion unifies multi-task learning with fashion-aware perception and interactive reasoning.
  • Figure 2: Comparison between existing fashion datasets and FashionX. Prior datasets show incomplete and inconsistent annotations (marked in red), while FashionX offers a unified head-to-toe coverage with hierarchical annotation structure spanning description, global- and part-level semantics.
  • Figure 3: Overview of the FashionX annotation pipeline.
  • Figure 4: Pipeline of OmniFashion. OmniFashion builds on FashionX datasets that first construct data with garment and corresponding description/attribute. The VLM output will be penalized by the constructed dialogue as an answer.
  • Figure 5: Illustration of the progressive learning task design in OmniFashion. Different color stands for different tasks.
  • ...and 3 more figures