Table of Contents
Fetching ...

Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources

Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, Anton van den Hengel

TL;DR

This work tackles open-ended visual question answering by integrating an image-derived textual representation with external knowledge from a large KB. It introduces a three-stream textual representation—attribute-based, caption-based, and KB-derived—merged via a multi-input LSTM encoder-decoder to generate natural-language answers. The approach achieves state-of-the-art results on Toronto COCO-QA and strong performance on the VQA dataset, demonstrating that external knowledge substantially enhances questions requiring information beyond the image. The method emphasizes generality and potential for deeper scene understanding with larger, more informative KBs, and suggests future work on generating KB queries tailored to the image-question content.

Abstract

We propose a method for visual question answering which combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. This allows more complex questions to be answered using the predominant neural network-based approach than has previously been possible. It particularly allows questions to be asked about the contents of an image, even when the image itself does not contain the whole answer. The method constructs a textual representation of the semantic content of an image, and merges it with textual information sourced from a knowledge base, to develop a deeper understanding of the scene viewed. Priming a recurrent neural network with this combined information, and the submitted question, leads to a very flexible visual question answering approach. We are specifically able to answer questions posed in natural language, that refer to information not contained in the image. We demonstrate the effectiveness of our model on two publicly available datasets, Toronto COCO-QA and MS COCO-VQA and show that it produces the best reported results in both cases.

Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources

TL;DR

This work tackles open-ended visual question answering by integrating an image-derived textual representation with external knowledge from a large KB. It introduces a three-stream textual representation—attribute-based, caption-based, and KB-derived—merged via a multi-input LSTM encoder-decoder to generate natural-language answers. The approach achieves state-of-the-art results on Toronto COCO-QA and strong performance on the VQA dataset, demonstrating that external knowledge substantially enhances questions requiring information beyond the image. The method emphasizes generality and potential for deeper scene understanding with larger, more informative KBs, and suggests future work on generating KB queries tailored to the image-question content.

Abstract

We propose a method for visual question answering which combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. This allows more complex questions to be answered using the predominant neural network-based approach than has previously been possible. It particularly allows questions to be asked about the contents of an image, even when the image itself does not contain the whole answer. The method constructs a textual representation of the semantic content of an image, and merges it with textual information sourced from a knowledge base, to develop a deeper understanding of the scene viewed. Priming a recurrent neural network with this combined information, and the submitted question, leads to a very flexible visual question answering approach. We are specifically able to answer questions posed in natural language, that refer to information not contained in the image. We demonstrate the effectiveness of our model on two publicly available datasets, Toronto COCO-QA and MS COCO-VQA and show that it produces the best reported results in both cases.

Paper Structure

This paper contains 15 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: A real case of question answering based on an internal textual representation and external knowledge. All of the attributes, textual representation, knowledge and answer are produced by our VQA model. Underlined words indicate the information required to answer the question.
  • Figure 2: Our proposed framework: given an image, a CNN is first applied to produce the attribute-based representation ${V_{att}}(I)$. The internal textual representation is made up of image captions generated based on the image-attributes. The hidden state of the caption-LSTM after it has generated the last word in each caption is used as its vector representation. These vectors are then aggregated as ${V_{cap}}(I)$ with average-pooling. The external knowledge is mined from the KB (in this case DBpedia) and the responses encoded by Doc2Vec, which produces a vector ${V_{know}}(I)$. The 3 vectors $\mathbf{V}$ are combined into a single representation of scene content, which is input to the VQA LSTM model which interprets the question and generates an answer.
  • Figure 3: Examples of predicted attributes and generated captions.
  • Figure 4: An example of SPARQL query language for the attribute 'dog'. The mined text-based knowledge are shown below.
  • Figure 5: Performance on five question categories for different models. The 'Object' category is the average accuracy of question types starting with 'what kind/type/sport/animal/brand...'.