Table of Contents
Fetching ...

TallyQA: Answering Complex Counting Questions

Manoj Acharya, Kushal Kafle, Christopher Kanan

TL;DR

This work splits open-ended counting in VQA into simple vs complex questions and introduces TallyQA as a large dataset designed to probe both capabilities. It presents the Relational Counting Network (RCN), a two-branch relation-network architecture that reasons over region proposals and background patches to count objects under complex relations. RCN achieves state-of-the-art performance on HowMany-QA and on both Test-Simple and Test-Complex splits of TallyQA, with ablations highlighting the value of incorporating background context and spatial relations. The paper also details dataset construction, annotation procedures, and visualization analyses to illuminate how relational reasoning improves counting in natural scenes, setting a path for future improvements in open-ended counting.

Abstract

Most counting questions in visual question answering (VQA) datasets are simple and require no more than object detection. Here, we study algorithms for complex counting questions that involve relationships between objects, attribute identification, reasoning, and more. To do this, we created TallyQA, the world's largest dataset for open-ended counting. We propose a new algorithm for counting that uses relation networks with region proposals. Our method lets relation networks be efficiently used with high-resolution imagery. It yields state-of-the-art results compared to baseline and recent systems on both TallyQA and the HowMany-QA benchmark.

TallyQA: Answering Complex Counting Questions

TL;DR

This work splits open-ended counting in VQA into simple vs complex questions and introduces TallyQA as a large dataset designed to probe both capabilities. It presents the Relational Counting Network (RCN), a two-branch relation-network architecture that reasons over region proposals and background patches to count objects under complex relations. RCN achieves state-of-the-art performance on HowMany-QA and on both Test-Simple and Test-Complex splits of TallyQA, with ablations highlighting the value of incorporating background context and spatial relations. The paper also details dataset construction, annotation procedures, and visualization analyses to illuminate how relational reasoning improves counting in natural scenes, setting a path for future improvements in open-ended counting.

Abstract

Most counting questions in visual question answering (VQA) datasets are simple and require no more than object detection. Here, we study algorithms for complex counting questions that involve relationships between objects, attribute identification, reasoning, and more. To do this, we created TallyQA, the world's largest dataset for open-ended counting. We propose a new algorithm for counting that uses relation networks with region proposals. Our method lets relation networks be efficiently used with high-resolution imagery. It yields state-of-the-art results compared to baseline and recent systems on both TallyQA and the HowMany-QA benchmark.

Paper Structure

This paper contains 24 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Counting datasets consist mostly of simple questions (top) that can be answered solely using object detection. We study complex counting questions (bottom) that require more than object detection using our new TallyQA dataset.
  • Figure 2: Histogram of answer counts for each of the three splits of TallyQA.
  • Figure 3: Our RCN model computes the relationship between foreground regions as well as the relationships between the these regions and the background to efficiently answer complex counting questions. In this example, the system needs to look at the relationship of each giraffe to each other and with the water (background).
  • Figure 4: Example model outputs on TallyQA. While other models fail at positional reasoning questions (e.g. Fig. \ref{['subfig:c']}), RCN can infer an object's relative position to other objects. Since RCN is based on region proposals, it struggles when proposals do not align with question relevant objects (Fig. \ref{['subfig:f']}).
  • Figure 5: Modified Grad-CAM visualizations show where RNC is looking to make predictions. The importance of each object proposals is proportional to the color intensity of the bounding boxes.