Table of Contents
Fetching ...

Muse: A Multimodal Conversational Recommendation Dataset with Scenario-Grounded User Profiles

Zihan Wang, Xiaocui Yang, Yongkang Liu, Shi Feng, Daling Wang, Yifei Zhang

TL;DR

Muse addresses the gap between text-only conversational recommendation and real-world multimodal shopping by introducing the first multimodal conversational recommendation dataset (MCR) built with scenario-grounded user profiles. It uses a multi-agent framework powered by multimodal large language models to automatically synthesize 7,000 conversations (83,148 utterances) in the clothing domain, supported by a local multimodal product database and a two-step user profiling process. Extensive evaluations by humans and LLM-guided metrics show high fluency, diversity, and multimodal coherence, and fine-tuning on open-source MLLMs yields reliable recommendation and response patterns, validating Muse as a scalable benchmark for multimodal CR. Muse thus enables scalable, scenario-aware, multimodal interactions and is poised to generalize to additional domains beyond clothing.

Abstract

Current conversational recommendation systems focus predominantly on text. However, real-world recommendation settings are generally multimodal, causing a significant gap between existing research and practical applications. To address this issue, we propose Muse, the first multimodal conversational recommendation dataset. Muse comprises 83,148 utterances from 7,000 conversations centered around the Clothing domain. Each conversation contains comprehensive multimodal interactions, rich elements, and natural dialogues. Data in Muse are automatically synthesized by a multi-agent framework powered by multimodal large language models (MLLMs). It innovatively derives user profiles from real-world scenarios rather than depending on manual design and history data for better scalability, and then it fulfills conversation simulation and optimization. Both human and LLM evaluations demonstrate the high quality of conversations in Muse. Additionally, fine-tuning experiments on three MLLMs demonstrate Muse's learnable patterns for recommendations and responses, confirming its value for multimodal conversational recommendation. Our dataset and codes are available at https://anonymous.4open.science/r/Muse-0086.

Muse: A Multimodal Conversational Recommendation Dataset with Scenario-Grounded User Profiles

TL;DR

Muse addresses the gap between text-only conversational recommendation and real-world multimodal shopping by introducing the first multimodal conversational recommendation dataset (MCR) built with scenario-grounded user profiles. It uses a multi-agent framework powered by multimodal large language models to automatically synthesize 7,000 conversations (83,148 utterances) in the clothing domain, supported by a local multimodal product database and a two-step user profiling process. Extensive evaluations by humans and LLM-guided metrics show high fluency, diversity, and multimodal coherence, and fine-tuning on open-source MLLMs yields reliable recommendation and response patterns, validating Muse as a scalable benchmark for multimodal CR. Muse thus enables scalable, scenario-aware, multimodal interactions and is poised to generalize to additional domains beyond clothing.

Abstract

Current conversational recommendation systems focus predominantly on text. However, real-world recommendation settings are generally multimodal, causing a significant gap between existing research and practical applications. To address this issue, we propose Muse, the first multimodal conversational recommendation dataset. Muse comprises 83,148 utterances from 7,000 conversations centered around the Clothing domain. Each conversation contains comprehensive multimodal interactions, rich elements, and natural dialogues. Data in Muse are automatically synthesized by a multi-agent framework powered by multimodal large language models (MLLMs). It innovatively derives user profiles from real-world scenarios rather than depending on manual design and history data for better scalability, and then it fulfills conversation simulation and optimization. Both human and LLM evaluations demonstrate the high quality of conversations in Muse. Additionally, fine-tuning experiments on three MLLMs demonstrate Muse's learnable patterns for recommendations and responses, confirming its value for multimodal conversational recommendation. Our dataset and codes are available at https://anonymous.4open.science/r/Muse-0086.

Paper Structure

This paper contains 46 sections, 21 figures, 9 tables.

Figures (21)

  • Figure 1: Comparison of data cases from Redial and Muse. Red denotes interactions about visual features, and green shows scenario-related content.
  • Figure 2: The multi-agent framework for synthesizing MCR data in Muse.
  • Figure 3: Workflow of the scenario-grounded user profile generator
  • Figure 4: Distribution of dialogue Elements in Muse
  • Figure 5: Utterance-level comparison: the quality between human responses and Muse's utterances,
  • ...and 16 more figures