Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Jiazheng Liu, Sipeng Zheng, Börje F. Karlsson, Zongqing Lu
TL;DR
This paper addresses the challenge of maintaining explicit attention to relevant visual regions across multi-turn multimodal dialogues. It introduces MMDiag, a benchmark with Everyday, Tabular, and Minigrid scenarios, generated via rule-based methods and GPT-4o-mini to create correlated, multi-turn QA with grounded regions. The proposed DiagNote architecture employs two interacting modules, Deliberate and Gaze, to enable stepwise reasoning and targeted grounding, trained on MMDiag and auxiliary grounding data. Across experiments, DiagNote demonstrates improved grounding and multi-turn reasoning over baselines, especially in complex, multi-region tasks, while revealing challenges in tiny-region grounding and high-resolution inputs. The work advances MLLMs toward human-like visual reasoning in extended dialogues and provides a robust benchmark for future development.
Abstract
Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring strong correlations between questions, between questions and images, and among different image regions; thus aligning more closely with real-world scenarios. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing, we present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities. DiagNote consists of two modules (Deliberate and Gaze) interacting with each other to perform Chain-of-Thought and annotations respectively, throughout multi-turn dialogues. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.
