Table of Contents
Fetching ...

Where To Look: Focus Regions for Visual Question Answering

Kevin J. Shih, Saurabh Singh, Derek Hoiem

TL;DR

This work tackles visual question answering by learning where to look in an image. It introduces a region-selection mechanism that jointly embeds language and region features into a shared latent space to score QA pairs, using a margin-based objective and 100 region candidates (including a whole-image region). A 4-bin word2vec-based language representation and region-weighted CNN features yield strong improvements on the MS COCO VQA 18-way multiple-choice task, especially for questions requiring precise localization such as color and room identification. The results demonstrate explicit region grounding can outperform full-image and language-only baselines and point to future directions in counting, reading, and integrating detectors or external knowledge.

Abstract

We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method exhibits significant improvements in answering questions such as "what color," where it is necessary to evaluate a specific location, and "what room," where it selectively identifies informative image regions. Our model is tested on the VQA dataset which is the largest human-annotated visual question answering dataset to our knowledge.

Where To Look: Focus Regions for Visual Question Answering

TL;DR

This work tackles visual question answering by learning where to look in an image. It introduces a region-selection mechanism that jointly embeds language and region features into a shared latent space to score QA pairs, using a margin-based objective and 100 region candidates (including a whole-image region). A 4-bin word2vec-based language representation and region-weighted CNN features yield strong improvements on the MS COCO VQA 18-way multiple-choice task, especially for questions requiring precise localization such as color and room identification. The results demonstrate explicit region grounding can outperform full-image and language-only baselines and point to future directions in counting, reading, and integrating detectors or external knowledge.

Abstract

We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method exhibits significant improvements in answering questions such as "what color," where it is necessary to evaluate a specific location, and "what room," where it selectively identifies informative image regions. Our model is tested on the VQA dataset which is the largest human-annotated visual question answering dataset to our knowledge.

Paper Structure

This paper contains 13 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Our goal is to identify the correct answer for a natural language question, such as "What color is the walk light?" or "Is it raining?" We particularly focus on the problem of learning where to look. This is a challenging problem as it requires grounding language with vision and learning to recognize objects, use relations, and determine relevance. For example, whether it is raining may be determined by detecting the presence of puddles gray skies, or umbrellas in the scene, whereas the color of the walk light requires focused attention on the light alone. The above figure shows example attention regions produced by our proposed model.
  • Figure 2: Examples from VQA VQA. From left to right, the above examples require focused region information to pinpoint the dots, whole image information to determine the weather, and abstract knowledge regarding relationships between children and stuffed animals.
  • Figure 3: Overview of our network for the example question-answer pairing: "What color is the fire hydrant? Yellow." Question and answer representations are concatenated, fed through the network, then combined with selectively weighted image region features to produce a score.
  • Figure 4: Example parse-based binning of questions. Each bin is represented with the average of the word2vec vectors of its members. Empty bins are represented with a zero-vector.
  • Figure 5: Comparison of attention regions generated by various question-answer pairings for the same question. Each visualization is labeled with its corresponding answer choice and returned confidence. We show the highlighted regions for the top multiple choice answers and some unrelated ones. Notice that in the first example, while the model clearly identified a green region within the image to match the "green" option, the corresponding confidence was significantly lower than that of the correct options, showing that the model does more than just match answer choices with image regions.
  • ...and 3 more figures