Table of Contents
Fetching ...

Multi-Turn Multi-Modal Question Clarification for Enhanced Conversational Understanding

Kimia Ramezan, Alireza Amiri Bavandpour, Yifei Yuan, Clemencia Siro, Mohammad Aliannejadi

TL;DR

This work defines the Multi-turn Multi-modal Clarifying Questions (MMCQ) task to refine user intent through progressive dialogue that incorporates visual context. It introduces the ClariMM dataset and the Mario retrieval framework, a two-phase system that uses BM25 for initial retrieval followed by a multi-modal generative re-ranking model operating on text and images, trained with a constrained decoding objective. Experiments show that incorporating images and multi-turn interaction yields up to 12.88% improvements in MRR and that Mario outperforms uni-modal and single-turn baselines, especially for longer conversations and unseen topics. The results highlight the practical impact of progressive, visual-grounded clarification for open-domain conversational search and provide a public dataset to spur further research.

Abstract

Conversational query clarification enables users to refine their search queries through interactive dialogue, improving search effectiveness. Traditional approaches rely on text-based clarifying questions, which often fail to capture complex user preferences, particularly those involving visual attributes. While recent work has explored single-turn multi-modal clarification with images alongside text, such methods do not fully support the progressive nature of user intent refinement over multiple turns. Motivated by this, we introduce the Multi-turn Multi-modal Clarifying Questions (MMCQ) task, which combines text and visual modalities to refine user queries in a multi-turn conversation. To facilitate this task, we create a large-scale dataset named ClariMM comprising over 13k multi-turn interactions and 33k question-answer pairs containing multi-modal clarifying questions. We propose Mario, a retrieval framework that employs a two-phase ranking strategy: initial retrieval with BM25, followed by a multi-modal generative re-ranking model that integrates textual and visual information from conversational history. Our experiments show that multi-turn multi-modal clarification outperforms uni-modal and single-turn approaches, improving MRR by 12.88%. The gains are most significant in longer interactions, demonstrating the value of progressive refinement for complex queries.

Multi-Turn Multi-Modal Question Clarification for Enhanced Conversational Understanding

TL;DR

This work defines the Multi-turn Multi-modal Clarifying Questions (MMCQ) task to refine user intent through progressive dialogue that incorporates visual context. It introduces the ClariMM dataset and the Mario retrieval framework, a two-phase system that uses BM25 for initial retrieval followed by a multi-modal generative re-ranking model operating on text and images, trained with a constrained decoding objective. Experiments show that incorporating images and multi-turn interaction yields up to 12.88% improvements in MRR and that Mario outperforms uni-modal and single-turn baselines, especially for longer conversations and unseen topics. The results highlight the practical impact of progressive, visual-grounded clarification for open-domain conversational search and provide a public dataset to spur further research.

Abstract

Conversational query clarification enables users to refine their search queries through interactive dialogue, improving search effectiveness. Traditional approaches rely on text-based clarifying questions, which often fail to capture complex user preferences, particularly those involving visual attributes. While recent work has explored single-turn multi-modal clarification with images alongside text, such methods do not fully support the progressive nature of user intent refinement over multiple turns. Motivated by this, we introduce the Multi-turn Multi-modal Clarifying Questions (MMCQ) task, which combines text and visual modalities to refine user queries in a multi-turn conversation. To facilitate this task, we create a large-scale dataset named ClariMM comprising over 13k multi-turn interactions and 33k question-answer pairs containing multi-modal clarifying questions. We propose Mario, a retrieval framework that employs a two-phase ranking strategy: initial retrieval with BM25, followed by a multi-modal generative re-ranking model that integrates textual and visual information from conversational history. Our experiments show that multi-turn multi-modal clarification outperforms uni-modal and single-turn approaches, improving MRR by 12.88%. The gains are most significant in longer interactions, demonstrating the value of progressive refinement for complex queries.

Paper Structure

This paper contains 25 sections, 6 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: An example conversation comparing the multi-modal query clarification under single-turn and multi-turn scenarios.
  • Figure 2: Overview of the Mario two-phase retrieval framework.
  • Figure 3: P@5 scores under different turn counts in ClariMM.
  • Figure 4: Dataset creation pipeline.