Enhancing Visual Dialog State Tracking through Iterative Object-Entity Alignment in Multi-Round Conversations
Wei Pang, Ruixue Duan, Jinfu Yang, Ning Li
TL;DR
This work targets Visual Dialog by addressing the neglect of round-level information flows in dialog history. It introduces Multi-round Dialogue State Tracking (MDST), which maintains a $2$-tuple vision-language dialogue state that is updated across rounds to ground each current question and produce accurate, coherent answers, while keeping vision states fixed. Grounding relies on object-entity alignment and a switching probability to fuse history and current cues, with a Transformer-based encoder-decoder for answer generation and a postdiction step to update language states. On VisDial v1.0, MDST achieves state-of-the-art generative results (including a validation JACC of $79.8\%$) and is validated by human studies showing longer, more human-like answers that remain consistently correct across rounds, all without large-scale pretraining or extra datasets.
Abstract
Visual Dialog (VD) is a task where an agent answers a series of image-related questions based on a multi-round dialog history. However, previous VD methods often treat the entire dialog history as a simple text input, disregarding the inherent conversational information flows at the round level. In this paper, we introduce Multi-round Dialogue State Tracking model (MDST), a framework that addresses this limitation by leveraging the dialogue state learned from dialog history to answer questions. MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations. These representations effectively ground the current question, enabling the generation of accurate answers. Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting. Furthermore, through a series of human studies, we validate the effectiveness of MDST in generating long, consistent, and human-like answers while consistently answering a series of questions correctly.
