Table of Contents
Fetching ...

BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation

Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Kang Zhang, Yu-Jung Heo, Du-Seong Chang, Chang D. Yoo

TL;DR

This work addresses Multimodal Dialogue Response Generation (MDRG) by tackling the information loss that arises when image history is only represented textually. The proposed BI-MDRG framework bridges image history to both text and image outputs via two mechanisms: (i) a bridging architecture with a multimodal causal attention mask that grounds textual responses in actual image features, and (ii) a Citation Module that augments textual image descriptions with object-citation tags to track consistency across turns, supported by an inference pipeline using a customized text-to-image model. A new Multimodal Dialogue Image Consistency (MDIC) dataset enables explicit evaluation of object consistency across conversations. Experimental results on ImageChat, PhotoChat, and MMDialog show BI-MDRG achieves stronger text generation and image grounding, and significantly improved image consistency (e.g., DINOv2 score of $0.53$ vs $0.32$ baselines), demonstrating practical gains for coherent multimodal dialogue systems. The work introduces a robust framework and a benchmark for measuring image-consistency in dialogues, with broader implications for reliable vision-language agents, while acknowledging reliance on specialized image-generation models as a current limitation.

Abstract

Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both based on the dialogue context. Due to the lack of a large-scale dataset specifically for this task and the benefits of leveraging powerful pre-trained models, previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach. However, this approach can overlook crucial information about the image, hindering 1) image-grounded text response and 2) consistency of objects in the image response. In this paper, we propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content and the consistency of objects in sequential image responses. Through extensive experiments on the multimodal dialogue benchmark dataset, we show that BI-MDRG can effectively increase the quality of multimodal dialogue. Additionally, recognizing the gap in benchmark datasets for evaluating the image consistency in multimodal dialogue, we have created a curated set of 300 dialogues annotated to track object consistency across conversations.

BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation

TL;DR

This work addresses Multimodal Dialogue Response Generation (MDRG) by tackling the information loss that arises when image history is only represented textually. The proposed BI-MDRG framework bridges image history to both text and image outputs via two mechanisms: (i) a bridging architecture with a multimodal causal attention mask that grounds textual responses in actual image features, and (ii) a Citation Module that augments textual image descriptions with object-citation tags to track consistency across turns, supported by an inference pipeline using a customized text-to-image model. A new Multimodal Dialogue Image Consistency (MDIC) dataset enables explicit evaluation of object consistency across conversations. Experimental results on ImageChat, PhotoChat, and MMDialog show BI-MDRG achieves stronger text generation and image grounding, and significantly improved image consistency (e.g., DINOv2 score of vs baselines), demonstrating practical gains for coherent multimodal dialogue systems. The work introduces a robust framework and a benchmark for measuring image-consistency in dialogues, with broader implications for reliable vision-language agents, while acknowledging reliance on specialized image-generation models as a current limitation.

Abstract

Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both based on the dialogue context. Due to the lack of a large-scale dataset specifically for this task and the benefits of leveraging powerful pre-trained models, previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach. However, this approach can overlook crucial information about the image, hindering 1) image-grounded text response and 2) consistency of objects in the image response. In this paper, we propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content and the consistency of objects in sequential image responses. Through extensive experiments on the multimodal dialogue benchmark dataset, we show that BI-MDRG can effectively increase the quality of multimodal dialogue. Additionally, recognizing the gap in benchmark datasets for evaluating the image consistency in multimodal dialogue, we have created a curated set of 300 dialogues annotated to track object consistency across conversations.
Paper Structure (47 sections, 2 equations, 17 figures, 5 tables, 1 algorithm)

This paper contains 47 sections, 2 equations, 17 figures, 5 tables, 1 algorithm.

Figures (17)

  • Figure 1: (a) Outlines the framework of previous Multimodal Dialogue Response Generation (MDRG) systems, which uses the textual descriptions of images ($u_t$) as an intermediary step toward generating image responses ($r_{t}^{\text{Image}}$). (b) Highlights the limitations of these systems, particularly their failure to fully leverage image history ($r_{1:t-1}^{\text{Image}}$) in crafting both the textual response ($r_{t}^{\text{Text}}$) and the image response ($r_{t}^{\text{Image}}$). (c) Illustrates the consequences of this oversight, including responses that lack grounding in image context and consistency in image-based replies.
  • Figure 2: Bridging Image History to the Text Response.
  • Figure 3: Training of BI-MDRG. (a) Textual Dialogue Response Generator $\mathcal{G}$ cross-attends to the image features from the Visual Encoder $\mathcal{V}.$ (b) Attention Mask Modulation alters the causal attention to prioritize image features over textual image descriptions. (c) Citation Module $\mathcal{C}$ generates citation-augmented textual image descriptions, enabling the tracking of objects within image history for consistency maintenance.
  • Figure 4: Illustration of the Citation Module. Citation Module recognizes identical objects within image history and injects this information into the textual image description with citation tags (e.g., [cite]0[/cite]).
  • Figure 5: Bridging Image History to the Textual Image Description.
  • ...and 12 more figures