Table of Contents
Fetching ...

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Jiazheng Liu, Sipeng Zheng, Börje F. Karlsson, Zongqing Lu

TL;DR

This paper addresses the challenge of maintaining explicit attention to relevant visual regions across multi-turn multimodal dialogues. It introduces MMDiag, a benchmark with Everyday, Tabular, and Minigrid scenarios, generated via rule-based methods and GPT-4o-mini to create correlated, multi-turn QA with grounded regions. The proposed DiagNote architecture employs two interacting modules, Deliberate and Gaze, to enable stepwise reasoning and targeted grounding, trained on MMDiag and auxiliary grounding data. Across experiments, DiagNote demonstrates improved grounding and multi-turn reasoning over baselines, especially in complex, multi-region tasks, while revealing challenges in tiny-region grounding and high-resolution inputs. The work advances MLLMs toward human-like visual reasoning in extended dialogues and provides a robust benchmark for future development.

Abstract

Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring strong correlations between questions, between questions and images, and among different image regions; thus aligning more closely with real-world scenarios. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing, we present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities. DiagNote consists of two modules (Deliberate and Gaze) interacting with each other to perform Chain-of-Thought and annotations respectively, throughout multi-turn dialogues. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

TL;DR

This paper addresses the challenge of maintaining explicit attention to relevant visual regions across multi-turn multimodal dialogues. It introduces MMDiag, a benchmark with Everyday, Tabular, and Minigrid scenarios, generated via rule-based methods and GPT-4o-mini to create correlated, multi-turn QA with grounded regions. The proposed DiagNote architecture employs two interacting modules, Deliberate and Gaze, to enable stepwise reasoning and targeted grounding, trained on MMDiag and auxiliary grounding data. Across experiments, DiagNote demonstrates improved grounding and multi-turn reasoning over baselines, especially in complex, multi-region tasks, while revealing challenges in tiny-region grounding and high-resolution inputs. The work advances MLLMs toward human-like visual reasoning in extended dialogues and provides a robust benchmark for future development.

Abstract

Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring strong correlations between questions, between questions and images, and among different image regions; thus aligning more closely with real-world scenarios. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing, we present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities. DiagNote consists of two modules (Deliberate and Gaze) interacting with each other to perform Chain-of-Thought and annotations respectively, throughout multi-turn dialogues. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.

Paper Structure

This paper contains 30 sections, 3 equations, 27 figures, 9 tables.

Figures (27)

  • Figure 1: Multi-turn multimodal dialogue: (a) Saliency tracking. The MLLM needs to focus on both the red triangle agent and the purple key, which scatter on the image, to answer the question correctly. (b) Saliency recall. The MLLM needs to retain focus on the region where the agent will stop after the last question.
  • Figure 2: Model architecture of DiagNote. Regions with blue backgrounds represent a deliberation step and the interaction between the Deliberate and Gaze modules. At each turn, the Deliberate module processes the original image, dialogue context, and buffers from both modules. It produces two outputs: (1) a Deliberate step, stored in the Deliberate buffer, and (2) a Gaze query, which is processed by the Gaze module. The resulting bounding boxes are then stored in the Gaze buffer.
  • Figure 3: Comparison for an example of the Minigrid scenario, one of the subsets in MMDiag. We give DiagNote (green) and GPT-4o (orange) the same environmental description and question. DiagNote focuses on the key regions and gives the correct reasoning process and the final answer. In contrast, GPT-4o fails to locate the object and thus gives the wrong answer. Examples for the MMDiag subsets of everyday scenarios and tabular scenes can be found in \ref{['Appendix:qual_cereb']}.
  • Figure 4: A grounding comparison between Grounding DINO and DiagNote's Gaze module , with the Gaze query "pink and white sign". In (a), the red bounding box represents the ground-truth answer, while the blue one indicates the output generated by the Gaze module in DiagNote. In (b), the red bounding boxes show the outputs produced by Grounding DINO.
  • Figure 5: The first example prompt for generating data samples in everyday scenes.
  • ...and 22 more figures