TallyQA: Answering Complex Counting Questions
Manoj Acharya, Kushal Kafle, Christopher Kanan
TL;DR
This work splits open-ended counting in VQA into simple vs complex questions and introduces TallyQA as a large dataset designed to probe both capabilities. It presents the Relational Counting Network (RCN), a two-branch relation-network architecture that reasons over region proposals and background patches to count objects under complex relations. RCN achieves state-of-the-art performance on HowMany-QA and on both Test-Simple and Test-Complex splits of TallyQA, with ablations highlighting the value of incorporating background context and spatial relations. The paper also details dataset construction, annotation procedures, and visualization analyses to illuminate how relational reasoning improves counting in natural scenes, setting a path for future improvements in open-ended counting.
Abstract
Most counting questions in visual question answering (VQA) datasets are simple and require no more than object detection. Here, we study algorithms for complex counting questions that involve relationships between objects, attribute identification, reasoning, and more. To do this, we created TallyQA, the world's largest dataset for open-ended counting. We propose a new algorithm for counting that uses relation networks with region proposals. Our method lets relation networks be efficiently used with high-resolution imagery. It yields state-of-the-art results compared to baseline and recent systems on both TallyQA and the HowMany-QA benchmark.
