Table of Contents
Fetching ...

Asking Multimodal Clarifying Questions in Mixed-Initiative Conversational Search

Yifei Yuan, Clemencia Siro, Mohammad Aliannejadi, Maarten de Rijke, Wai Lam

TL;DR

This paper investigates improving open-domain mixed-initiative conversational search by incorporating multimodal clarifying questions, specifically images, into the clarification phase. It introduces the Melon dataset (over 4k multimodal clarifying questions with ~14k images) and Marto, a prompt-based multimodal generative retrieval model built on VLT5, designed to decide when to attach images and to generate document identifiers. Experiments show that adding images yields large gains in retrieval effectiveness (up to 90% in certain metrics) and that Marto outperforms discriminative baselines in both effectiveness and efficiency, with faster training and inference. The work highlights that multimodal content leads to more contextualized and informative user responses and outlines directions for future research, including multi-turn MQC and leveraging multimodal LLM backbones like BLIP-2 or GPT-4V for further gains.

Abstract

In mixed-initiative conversational search systems, clarifying questions are used to help users who struggle to express their intentions in a single query. These questions aim to uncover user's information needs and resolve query ambiguities. We hypothesize that in scenarios where multimodal information is pertinent, the clarification process can be improved by using non-textual information. Therefore, we propose to add images to clarifying questions and formulate the novel task of asking multimodal clarifying questions in open-domain, mixed-initiative conversational search systems. To facilitate research into this task, we collect a dataset named Melon that contains over 4k multimodal clarifying questions, enriched with over 14k images. We also propose a multimodal query clarification model named Marto and adopt a prompt-based, generative fine-tuning strategy to perform the training of different stages with different prompts. Several analyses are conducted to understand the importance of multimodal contents during the query clarification phase. Experimental results indicate that the addition of images leads to significant improvements of up to 90% in retrieval performance when selecting the relevant images. Extensive analyses are also performed to show the superiority of Marto compared with discriminative baselines in terms of effectiveness and efficiency.

Asking Multimodal Clarifying Questions in Mixed-Initiative Conversational Search

TL;DR

This paper investigates improving open-domain mixed-initiative conversational search by incorporating multimodal clarifying questions, specifically images, into the clarification phase. It introduces the Melon dataset (over 4k multimodal clarifying questions with ~14k images) and Marto, a prompt-based multimodal generative retrieval model built on VLT5, designed to decide when to attach images and to generate document identifiers. Experiments show that adding images yields large gains in retrieval effectiveness (up to 90% in certain metrics) and that Marto outperforms discriminative baselines in both effectiveness and efficiency, with faster training and inference. The work highlights that multimodal content leads to more contextualized and informative user responses and outlines directions for future research, including multi-turn MQC and leveraging multimodal LLM backbones like BLIP-2 or GPT-4V for further gains.

Abstract

In mixed-initiative conversational search systems, clarifying questions are used to help users who struggle to express their intentions in a single query. These questions aim to uncover user's information needs and resolve query ambiguities. We hypothesize that in scenarios where multimodal information is pertinent, the clarification process can be improved by using non-textual information. Therefore, we propose to add images to clarifying questions and formulate the novel task of asking multimodal clarifying questions in open-domain, mixed-initiative conversational search systems. To facilitate research into this task, we collect a dataset named Melon that contains over 4k multimodal clarifying questions, enriched with over 14k images. We also propose a multimodal query clarification model named Marto and adopt a prompt-based, generative fine-tuning strategy to perform the training of different stages with different prompts. Several analyses are conducted to understand the importance of multimodal contents during the query clarification phase. Experimental results indicate that the addition of images leads to significant improvements of up to 90% in retrieval performance when selecting the relevant images. Extensive analyses are also performed to show the superiority of Marto compared with discriminative baselines in terms of effectiveness and efficiency.
Paper Structure (17 sections, 3 equations, 7 figures, 11 tables)

This paper contains 17 sections, 3 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: An example of incorporating multimodal information into the query clarification phase.
  • Figure 2: A workflow of adding MQC phase in a conversational search system. Hashed modules remain the same as in the unimodal clarification system presented in AliannejadiSigir19.
  • Figure 3: Distribution of answer length w.r.t terms in (a) unimodal and (b) multimodal datasets. Density represents the proportion of each type of answer in the answer set.
  • Figure 4: The three modules in the Marto model.
  • Figure 5: The detailed structure of Encoder $\Phi$ and Decoder $\Psi$.
  • ...and 2 more figures