Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models
Chang-Sheng Kao, Yun-Nung Chen
TL;DR
This work tackles the challenge of image retrieval in dialogue by using large language models to generate dialogue-associated visual descriptors that connect conversations to candidate images. The approach employs visually-focused queries to elicit descriptors, then computes scene-aligned (text-only) and vision-aligned (multimodal) retrieval scores, enabling zero-shot and contrastive-learning-based image retrieval. Key findings show state-of-the-art performance on PhotoChat, with strong generalization to VisDial and MMDialog, and robust gains from descriptor ensembles and carefully chosen queries. The study demonstrates the practical potential of LLM-guided visual descriptor generation for bridging complex dialogues and images in real-world multimodal systems.
Abstract
Recent advancements in dialogue systems have highlighted the significance of integrating multimodal responses, which enable conveying ideas through diverse modalities rather than solely relying on text-based interactions. This enrichment not only improves overall communicative efficacy but also enhances the quality of conversational experiences. However, existing methods for dialogue-to-image retrieval face limitations due to the constraints of pre-trained vision language models (VLMs) in comprehending complex dialogues accurately. To address this, we present a novel approach leveraging the robust reasoning capabilities of large language models (LLMs) to generate precise dialogue-associated visual descriptors, facilitating seamless connection with images. Extensive experiments conducted on benchmark data validate the effectiveness of our proposed approach in deriving concise and accurate visual descriptors, leading to significant enhancements in dialogue-to-image retrieval performance. Furthermore, our findings demonstrate the method's generalizability across diverse visual cues, various LLMs, and different datasets, underscoring its practicality and potential impact in real-world applications.
