Table of Contents
Fetching ...

Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

TL;DR

This work tackles visual question answering by proposing Ask Your Neurons, an end-to-end encoder–decoder framework that conditions on both image and question to generate answer sequences. The authors explore a modular architecture with diverse question encoders (LSTM, GRU, BOW, CNN), multiple visual encoders, and several multimodal fusion and decoding strategies, emphasizing a joint training regime. They extend the DAQUAR dataset with DAQUAR-Consensus to study inter-human agreement and introduce two consensus metrics, revealing substantial language-driven baselines and common-sense cues. The approach is further evaluated on the large-scale VQA dataset, showing competitive performance with a strong emphasis on a global visual representation and thorough analysis of design choices, including question encoding and fusion strategies. Overall, the paper highlights the importance of robust visual models and language–vision integration, introduces valuable consensus resources, and demonstrates the practicality of end-to-end multimodal QA systems for real-world imagery.

Abstract

We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We provide additional insights into the problem by analyzing how much information is contained only in the language part for which we provide a new human baseline. To study human consensus, which is related to the ambiguities inherent in this challenging task, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Moreover, we also extend our analysis to VQA, a large-scale question answering about images dataset, where we investigate some particular design choices and show the importance of stronger visual models. At the same time, we achieve strong performance of our model that still uses a global image representation. Finally, based on such analysis, we refine our Ask Your Neurons on DAQUAR, which also leads to a better performance on this challenging task.

Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

TL;DR

This work tackles visual question answering by proposing Ask Your Neurons, an end-to-end encoder–decoder framework that conditions on both image and question to generate answer sequences. The authors explore a modular architecture with diverse question encoders (LSTM, GRU, BOW, CNN), multiple visual encoders, and several multimodal fusion and decoding strategies, emphasizing a joint training regime. They extend the DAQUAR dataset with DAQUAR-Consensus to study inter-human agreement and introduce two consensus metrics, revealing substantial language-driven baselines and common-sense cues. The approach is further evaluated on the large-scale VQA dataset, showing competitive performance with a strong emphasis on a global visual representation and thorough analysis of design choices, including question encoding and fusion strategies. Overall, the paper highlights the importance of robust visual models and language–vision integration, introduces valuable consensus resources, and demonstrates the practicality of end-to-end multimodal QA systems for real-world imagery.

Abstract

We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We provide additional insights into the problem by analyzing how much information is contained only in the language part for which we provide a new human baseline. To study human consensus, which is related to the ambiguities inherent in this challenging task, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Moreover, we also extend our analysis to VQA, a large-scale question answering about images dataset, where we investigate some particular design choices and show the importance of stronger visual models. At the same time, we achieve strong performance of our model that still uses a global image representation. Finally, based on such analysis, we refine our Ask Your Neurons on DAQUAR, which also leads to a better performance on this challenging task.

Paper Structure

This paper contains 67 sections, 9 equations, 9 figures, 33 tables.

Figures (9)

  • Figure 1: Our approach Ask Your Neurons to question answering with a Recurrent Neural Network using Long Short Term Memory (LSTM). To answer a question about an image, we feed in both, the image (CNN features) and the question (green boxes) into the LSTM. After the (variable length) question is encoded, we generate the answers (multiple words, orange boxes). During the answer generation phase the previously predicted answers are fed into the LSTM until the $\langle$END$\rangle$ symbol is predicted. See \ref{['sec:iccvArch']} for more details.
  • Figure 2: Our approach Ask Your Neurons, see \ref{['sec:method']} for details.
  • Figure 3: LSTM unit. See \ref{['sec:LSTM']}, Equations (\ref{['eq:i']})-(\ref{['eq:h']}) for details.
  • Figure 4: Our Refined Ask Your Neurons architecture for answering questions about images that includes the following modules: visual and question encoders, and answer decoder. A multimodal embedding $C$ combines both encodings into a joint space that the decoder decodes from. See \ref{['sec:alternative_approaches']} for details.
  • Figure 5: CNN for encoding the question that convolves word embeddings (learnt or pre-trained) with different kernels, second and third views are shown, see \ref{['sec:question_cnn']} and yang2015stacked for details.
  • ...and 4 more figures