Table of Contents
Fetching ...

Dynamic Memory Networks for Visual and Textual Question Answering

Caiming Xiong, Stephen Merity, Richard Socher

TL;DR

This work advances question answering by extending Dynamic Memory Networks (DMN) to a cross-modal, supervision-light setting through DMN+. Key contributions include an input fusion layer for text, a dedicated input module for visual data, an attention-based GRU for the episodic memory, and an untied memory update with ReLU, enabling the model to learn to select relevant facts without explicit annotation. The DMN+ achieves state-of-the-art results on both the bAbI-10k text QA tasks and the Visual Question Answering (VQA) dataset, validating generalization from language to vision and demonstrating effective reasoning over dispersed information. The results indicate that combining powerful input representations, structured attention, and flexible memory updates can yield robust cross-modal QA without fact-level supervision, with potential for broader multi-modal reasoning tasks.

Abstract

Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering. One such architecture, the dynamic memory network (DMN), obtained high accuracy on a variety of language tasks. However, it was not shown whether the architecture achieves strong results for question answering when supporting facts are not marked during training or whether it could be applied to other modalities such as images. Based on an analysis of the DMN, we propose several improvements to its memory and input modules. Together with these changes we introduce a novel input module for images in order to be able to answer visual questions. Our new DMN+ model improves the state of the art on both the Visual Question Answering dataset and the \babi-10k text question-answering dataset without supporting fact supervision.

Dynamic Memory Networks for Visual and Textual Question Answering

TL;DR

This work advances question answering by extending Dynamic Memory Networks (DMN) to a cross-modal, supervision-light setting through DMN+. Key contributions include an input fusion layer for text, a dedicated input module for visual data, an attention-based GRU for the episodic memory, and an untied memory update with ReLU, enabling the model to learn to select relevant facts without explicit annotation. The DMN+ achieves state-of-the-art results on both the bAbI-10k text QA tasks and the Visual Question Answering (VQA) dataset, validating generalization from language to vision and demonstrating effective reasoning over dispersed information. The results indicate that combining powerful input representations, structured attention, and flexible memory updates can yield robust cross-modal QA without fact-level supervision, with potential for broader multi-modal reasoning tasks.

Abstract

Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering. One such architecture, the dynamic memory network (DMN), obtained high accuracy on a variety of language tasks. However, it was not shown whether the architecture achieves strong results for question answering when supporting facts are not marked during training or whether it could be applied to other modalities such as images. Based on an analysis of the DMN, we propose several improvements to its memory and input modules. Together with these changes we introduce a novel input module for images in order to be able to answer visual questions. Our new DMN+ model improves the state of the art on both the Visual Question Answering dataset and the \babi-10k text question-answering dataset without supporting fact supervision.

Paper Structure

This paper contains 16 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Question Answering over text and images using a Dynamic Memory Network.
  • Figure 2: The input module with a "fusion layer", where the sentence reader encodes the sentence and the bi-directional GRU allows information to flow between sentences.
  • Figure 3: VQA input module to represent images for the DMN.
  • Figure 4: The episodic memory module of the DMN+ when using two passes. The $\overleftrightarrow{F}$ is the output of the input module.
  • Figure 5: (a) The traditional GRU model, and (b) the proposed attention-based GRU model
  • ...and 1 more figures