Table of Contents
Fetching ...

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

Zhe Gan, Yu Cheng, Ahmed El Kholy, Linjie Li, Jingjing Liu, Jianfeng Gao

TL;DR

<p>We address visual dialog by introducing ReDAN, a Recurrent Dual Attention Network that performs multi-step reasoning over image and dialog history. The model builds visual and textual memories, iteratively attends to them, and progressively updates a question representation to refine understanding before decoding answers. Key contributions include a memory-based multimodal reasoning framework, a Multimodal Factorized Bilinear fusion for integration, and a rank-aggregation strategy that combines discriminative and generative decoders, achieving state-of-the-art 64.47% NDCG on VisDial v1.0 and up to 67.12% with ensembles. The approach demonstrates how iterative reasoning yields sharper attention and more accurate answers, with implications for robust multimodal QA and dialog systems.</p>

Abstract

This paper presents a new model for visual dialog, Recurrent Dual Attention Network (ReDAN), using multi-step reasoning to answer a series of questions about an image. In each question-answering turn of a dialog, ReDAN infers the answer progressively through multiple reasoning steps. In each step of the reasoning process, the semantic representation of the question is updated based on the image and the previous dialog history, and the recurrently-refined representation is used for further reasoning in the subsequent step. On the VisDial v1.0 dataset, the proposed ReDAN model achieves a new state-of-the-art of 64.47% NDCG score. Visualization on the reasoning process further demonstrates that ReDAN can locate context-relevant visual and textual clues via iterative refinement, which can lead to the correct answer step-by-step.

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

TL;DR

<p>We address visual dialog by introducing ReDAN, a Recurrent Dual Attention Network that performs multi-step reasoning over image and dialog history. The model builds visual and textual memories, iteratively attends to them, and progressively updates a question representation to refine understanding before decoding answers. Key contributions include a memory-based multimodal reasoning framework, a Multimodal Factorized Bilinear fusion for integration, and a rank-aggregation strategy that combines discriminative and generative decoders, achieving state-of-the-art 64.47% NDCG on VisDial v1.0 and up to 67.12% with ensembles. The approach demonstrates how iterative reasoning yields sharper attention and more accurate answers, with implications for robust multimodal QA and dialog systems.</p>

Abstract

This paper presents a new model for visual dialog, Recurrent Dual Attention Network (ReDAN), using multi-step reasoning to answer a series of questions about an image. In each question-answering turn of a dialog, ReDAN infers the answer progressively through multiple reasoning steps. In each step of the reasoning process, the semantic representation of the question is updated based on the image and the previous dialog history, and the recurrently-refined representation is used for further reasoning in the subsequent step. On the VisDial v1.0 dataset, the proposed ReDAN model achieves a new state-of-the-art of 64.47% NDCG score. Visualization on the reasoning process further demonstrates that ReDAN can locate context-relevant visual and textual clues via iterative refinement, which can lead to the correct answer step-by-step.

Paper Structure

This paper contains 36 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Model architecture and visualization of the learned multi-step reasoning strategies. In the first step, ReDAN first focuses on all relevant objects in the image (e.g., "boy", "shorts"), and all relevant facts in the dialog history (e.g., "young boy", "playing tennis", "black hair"). In the second step, the model narrows down to more context-relevant regions and dialog context (i.e., the attention maps become sharper) which lead to the final correct answer ("yes"). The numbers in the bounding boxes and in the histograms are the attention weights of the corresponding objects or dialog history snippets.
  • Figure 2: Model Architecture of Recurrent Dual Attention Network for visual dialog. Please see Sec. \ref{['sec:method']} for details.
  • Figure 3: Visualization of learned attention maps in multiple reasoning steps.
  • Figure 4: Visualization of learned attention maps using 2 reasoning steps.
  • Figure 5: Visualization of learned attention maps using 3 reasoning steps.