Table of Contents
Fetching ...

BloomVQA: Assessing Hierarchical Multi-modal Comprehension

Yunye Gong, Robik Shrestha, Jared Claypoole, Michael Cogswell, Arijit Ray, Christopher Kanan, Ajay Divakaran

TL;DR

BloomVQA introduces a theory-grounded framework for evaluating multi-modal comprehension by linking VQA tasks to Bloom's Taxonomy and organizing knowledge via a Story Graph. The core dataset comprises 1200 samples from 20 picture stories labeled across six cognitive levels, with templates and 4-option answers, and is augmented to about 12k samples through graph traversal. The work defines consistency metrics, including $P_{m,n}$ and $AP$, to assess alignment with human comprehension and the impact of context augmentation, and evaluates CLIP, BLIP, BLIP2, and GPT-4V under both text-only and cross-modal conditions. Findings show higher-level tasks remain challenging for current systems; GPT-4V achieves strongest accuracy but exhibits tendencies to bypass visual grounding and inconsistent reasoning, highlighting the need for theoretically grounded evaluation and scalable augmentation to guide progress toward reliable, multi-modal understanding.

Abstract

We propose a novel VQA dataset, BloomVQA, to facilitate comprehensive evaluation of large vision-language models on comprehension tasks. Unlike current benchmarks that often focus on fact-based memorization and simple reasoning tasks without theoretical grounding, we collect multiple-choice samples based on picture stories that reflect different levels of comprehension, as laid out in Bloom's Taxonomy, a classic framework for learning assessment widely adopted in education research. Our data maps to a novel hierarchical graph representation which enables automatic data augmentation and novel measures characterizing model consistency. We perform graded evaluation and reliability analysis on recent multi-modal models. In comparison to low-level tasks, we observe decreased performance on tasks requiring advanced comprehension and cognitive skills with up to 38.0\% drop in VQA accuracy. In comparison to earlier models, GPT-4V demonstrates improved accuracy over all comprehension levels and shows a tendency of bypassing visual inputs especially for higher-level tasks. Current models also show consistency patterns misaligned with human comprehension in various scenarios, demonstrating the need for improvement based on theoretically-grounded criteria.

BloomVQA: Assessing Hierarchical Multi-modal Comprehension

TL;DR

BloomVQA introduces a theory-grounded framework for evaluating multi-modal comprehension by linking VQA tasks to Bloom's Taxonomy and organizing knowledge via a Story Graph. The core dataset comprises 1200 samples from 20 picture stories labeled across six cognitive levels, with templates and 4-option answers, and is augmented to about 12k samples through graph traversal. The work defines consistency metrics, including and , to assess alignment with human comprehension and the impact of context augmentation, and evaluates CLIP, BLIP, BLIP2, and GPT-4V under both text-only and cross-modal conditions. Findings show higher-level tasks remain challenging for current systems; GPT-4V achieves strongest accuracy but exhibits tendencies to bypass visual grounding and inconsistent reasoning, highlighting the need for theoretically grounded evaluation and scalable augmentation to guide progress toward reliable, multi-modal understanding.

Abstract

We propose a novel VQA dataset, BloomVQA, to facilitate comprehensive evaluation of large vision-language models on comprehension tasks. Unlike current benchmarks that often focus on fact-based memorization and simple reasoning tasks without theoretical grounding, we collect multiple-choice samples based on picture stories that reflect different levels of comprehension, as laid out in Bloom's Taxonomy, a classic framework for learning assessment widely adopted in education research. Our data maps to a novel hierarchical graph representation which enables automatic data augmentation and novel measures characterizing model consistency. We perform graded evaluation and reliability analysis on recent multi-modal models. In comparison to low-level tasks, we observe decreased performance on tasks requiring advanced comprehension and cognitive skills with up to 38.0\% drop in VQA accuracy. In comparison to earlier models, GPT-4V demonstrates improved accuracy over all comprehension levels and shows a tendency of bypassing visual inputs especially for higher-level tasks. Current models also show consistency patterns misaligned with human comprehension in various scenarios, demonstrating the need for improvement based on theoretically-grounded criteria.
Paper Structure (22 sections, 3 equations, 5 figures, 5 tables)

This paper contains 22 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Story graph: a hierarchical graph representation based on Bloom's Taxonomy
  • Figure 2: Graded evaluation on BloomVQA data following Bloom's Taxonomy bloom_figure. For VLP models, the VQA accuracy decreases as the task level increases, while the QA accuracy using no visual inputs remains low. For GPT-4V, the VQA accuracy greatly improves over all levels while the comparison to QA accuracy suggests that the model tends to either bypass or even get confused by visual contents especially at high levels.
  • Figure A1: We designed a Web-based UI with general instructions and detailed instructions at each Bloom's level provided to annotators without background expertise in the domain.
  • Figure A2: Example picture story: "Foxy joxy plays a trick"
  • Figure A3: Example questions and answers over different Bloom's levels on the picture story "Foxy joxy plays a trick"