Table of Contents
Fetching ...

Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models

Chang-Sheng Kao, Yun-Nung Chen

TL;DR

This work tackles the challenge of image retrieval in dialogue by using large language models to generate dialogue-associated visual descriptors that connect conversations to candidate images. The approach employs visually-focused queries to elicit descriptors, then computes scene-aligned (text-only) and vision-aligned (multimodal) retrieval scores, enabling zero-shot and contrastive-learning-based image retrieval. Key findings show state-of-the-art performance on PhotoChat, with strong generalization to VisDial and MMDialog, and robust gains from descriptor ensembles and carefully chosen queries. The study demonstrates the practical potential of LLM-guided visual descriptor generation for bridging complex dialogues and images in real-world multimodal systems.

Abstract

Recent advancements in dialogue systems have highlighted the significance of integrating multimodal responses, which enable conveying ideas through diverse modalities rather than solely relying on text-based interactions. This enrichment not only improves overall communicative efficacy but also enhances the quality of conversational experiences. However, existing methods for dialogue-to-image retrieval face limitations due to the constraints of pre-trained vision language models (VLMs) in comprehending complex dialogues accurately. To address this, we present a novel approach leveraging the robust reasoning capabilities of large language models (LLMs) to generate precise dialogue-associated visual descriptors, facilitating seamless connection with images. Extensive experiments conducted on benchmark data validate the effectiveness of our proposed approach in deriving concise and accurate visual descriptors, leading to significant enhancements in dialogue-to-image retrieval performance. Furthermore, our findings demonstrate the method's generalizability across diverse visual cues, various LLMs, and different datasets, underscoring its practicality and potential impact in real-world applications.

Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models

TL;DR

This work tackles the challenge of image retrieval in dialogue by using large language models to generate dialogue-associated visual descriptors that connect conversations to candidate images. The approach employs visually-focused queries to elicit descriptors, then computes scene-aligned (text-only) and vision-aligned (multimodal) retrieval scores, enabling zero-shot and contrastive-learning-based image retrieval. Key findings show state-of-the-art performance on PhotoChat, with strong generalization to VisDial and MMDialog, and robust gains from descriptor ensembles and carefully chosen queries. The study demonstrates the practical potential of LLM-guided visual descriptor generation for bridging complex dialogues and images in real-world multimodal systems.

Abstract

Recent advancements in dialogue systems have highlighted the significance of integrating multimodal responses, which enable conveying ideas through diverse modalities rather than solely relying on text-based interactions. This enrichment not only improves overall communicative efficacy but also enhances the quality of conversational experiences. However, existing methods for dialogue-to-image retrieval face limitations due to the constraints of pre-trained vision language models (VLMs) in comprehending complex dialogues accurately. To address this, we present a novel approach leveraging the robust reasoning capabilities of large language models (LLMs) to generate precise dialogue-associated visual descriptors, facilitating seamless connection with images. Extensive experiments conducted on benchmark data validate the effectiveness of our proposed approach in deriving concise and accurate visual descriptors, leading to significant enhancements in dialogue-to-image retrieval performance. Furthermore, our findings demonstrate the method's generalizability across diverse visual cues, various LLMs, and different datasets, underscoring its practicality and potential impact in real-world applications.
Paper Structure (28 sections, 4 equations, 2 figures, 10 tables)

This paper contains 28 sections, 4 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: The framework of our proposed method. We employ the text encoder from a pre-trained VLM to encode both the descriptor and the object list. This yields two distinctive features, namely the descriptor embedding ($e_\text{desc}$) and the object list feature ($e_\text{obj}$). Additionally, we utilize the VLM's image encoder to process and encode the image, resulting in the image embedding ($e_\text{img}$). The final retrieval score is then computed by aggregating a scene-aligned score and a vision-aligned score.
  • Figure 2: Results of different $\lambda$ in zero-shot scenarios. A smaller $\lambda$ indicates greater reliance on the scene-aligned score, while a larger $\lambda$ indicates greater reliance on the vision-aligned score.