Where To Look: Focus Regions for Visual Question Answering

Kevin J. Shih; Saurabh Singh; Derek Hoiem

Where To Look: Focus Regions for Visual Question Answering

Kevin J. Shih, Saurabh Singh, Derek Hoiem

TL;DR

This work tackles visual question answering by learning where to look in an image. It introduces a region-selection mechanism that jointly embeds language and region features into a shared latent space to score QA pairs, using a margin-based objective and 100 region candidates (including a whole-image region). A 4-bin word2vec-based language representation and region-weighted CNN features yield strong improvements on the MS COCO VQA 18-way multiple-choice task, especially for questions requiring precise localization such as color and room identification. The results demonstrate explicit region grounding can outperform full-image and language-only baselines and point to future directions in counting, reading, and integrating detectors or external knowledge.

Abstract

We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method exhibits significant improvements in answering questions such as "what color," where it is necessary to evaluate a specific location, and "what room," where it selectively identifies informative image regions. Our model is tested on the VQA dataset which is the largest human-annotated visual question answering dataset to our knowledge.

Where To Look: Focus Regions for Visual Question Answering

TL;DR

Abstract

Where To Look: Focus Regions for Visual Question Answering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)