Table of Contents
Fetching ...

VQA-Levels: A Hierarchical Approach for Classifying Questions in VQA

Madhuri Latha Madaka, Chakravarthy Bhagvati

TL;DR

The paper tackles the lack of a structured VQA benchmark by introducing VQA-Levels, a seven-level hierarchy that combines low-level visual features with higher-level semantic reasoning. It defines the levels, presents a pilot dataset of 210 images and 751 questions generated with a UI and human annotators, and grounds the design in Marr's theory and CBIR concepts. Experiments with leading VQA models reveal strong performance on Level 1–2 questions but much weaker results on scene-text (Level 3) and abstract levels (6–7), indicating language priors dominate and scene-text/vision understanding remains bottlenecks. The work offers a practical benchmarking framework that enables level-wise analysis, guiding future data collection and model development toward robust, multimodal reasoning.

Abstract

Designing datasets for Visual Question Answering (VQA) is a difficult and complex task that requires NLP for parsing and computer vision for analysing the relevant aspects of the image for answering the question asked. Several benchmark datasets have been developed by researchers but there are many issues with using them for methodical performance tests. This paper proposes a new benchmark dataset -- a pilot version called VQA-Levels is ready now -- for testing VQA systems systematically and assisting researchers in advancing the field. The questions are classified into seven levels ranging from direct answers based on low-level image features (without needing even a classifier) to those requiring high-level abstraction of the entire image content. The questions in the dataset exhibit one or many of ten properties. Each is categorised into a specific level from 1 to 7. Levels 1 - 3 are directly on the visual content while the remaining levels require extra knowledge about the objects in the image. Each question generally has a unique one or two-word answer. The questions are 'natural' in the sense that a human is likely to ask such a question when seeing the images. An example question at Level 1 is, ``What is the shape of the red colored region in the image?" while at Level 7, it is, ``Why is the man cutting the paper?". Initial testing of the proposed dataset on some of the existing VQA systems reveals that their success is high on Level 1 (low level features) and Level 2 (object classification) questions, least on Level 3 (scene text) followed by Level 6 (extrapolation) and Level 7 (whole scene analysis) questions. The work in this paper will go a long way to systematically analyze VQA systems.

VQA-Levels: A Hierarchical Approach for Classifying Questions in VQA

TL;DR

The paper tackles the lack of a structured VQA benchmark by introducing VQA-Levels, a seven-level hierarchy that combines low-level visual features with higher-level semantic reasoning. It defines the levels, presents a pilot dataset of 210 images and 751 questions generated with a UI and human annotators, and grounds the design in Marr's theory and CBIR concepts. Experiments with leading VQA models reveal strong performance on Level 1–2 questions but much weaker results on scene-text (Level 3) and abstract levels (6–7), indicating language priors dominate and scene-text/vision understanding remains bottlenecks. The work offers a practical benchmarking framework that enables level-wise analysis, guiding future data collection and model development toward robust, multimodal reasoning.

Abstract

Designing datasets for Visual Question Answering (VQA) is a difficult and complex task that requires NLP for parsing and computer vision for analysing the relevant aspects of the image for answering the question asked. Several benchmark datasets have been developed by researchers but there are many issues with using them for methodical performance tests. This paper proposes a new benchmark dataset -- a pilot version called VQA-Levels is ready now -- for testing VQA systems systematically and assisting researchers in advancing the field. The questions are classified into seven levels ranging from direct answers based on low-level image features (without needing even a classifier) to those requiring high-level abstraction of the entire image content. The questions in the dataset exhibit one or many of ten properties. Each is categorised into a specific level from 1 to 7. Levels 1 - 3 are directly on the visual content while the remaining levels require extra knowledge about the objects in the image. Each question generally has a unique one or two-word answer. The questions are 'natural' in the sense that a human is likely to ask such a question when seeing the images. An example question at Level 1 is, ``What is the shape of the red colored region in the image?" while at Level 7, it is, ``Why is the man cutting the paper?". Initial testing of the proposed dataset on some of the existing VQA systems reveals that their success is high on Level 1 (low level features) and Level 2 (object classification) questions, least on Level 3 (scene text) followed by Level 6 (extrapolation) and Level 7 (whole scene analysis) questions. The work in this paper will go a long way to systematically analyze VQA systems.

Paper Structure

This paper contains 10 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 5: Questions at Level 7 on whole scene analysis and abstraction
  • Figure 6: Q,A, L represents question, answer and level respectively. (a) shows question on invisible object (b) require analysis of the whole scene.