Table of Contents
Fetching ...

CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding

Minjung Shin, Seongho Choi, Yu-Jung Heo, Minsu Lee, Byoung-Tak Zhang, Jeh-Kwang Ryu

TL;DR

CogME addresses the insufficiency of single-number metrics for evaluating video story understanding by introducing a cognition-inspired, multi-dimensional framework. It decomposes understanding into TARGET, CONTENT, and THINKING, grounded in human cognitive processes and Bloom's taxonomy, and applies this scheme to DramaQA with manual annotation to produce rich diagnostic profiles. The approach reveals model strengths and dataset biases that conventional QA accuracy misses, enabling targeted model development and more balanced data curation. This framework has potential to guide evaluation and design across multimodal storytelling tasks, advancing AI toward higher-order cognitive capabilities.

Abstract

We introduce CogME, a cognition-inspired, multi-dimensional evaluation metric designed for AI models focusing on story understanding. CogME is a framework grounded in human thinking strategies and story elements that involve story understanding. With a specific breakdown of the questions, this approach provides a nuanced assessment revealing not only AI models' particular strengths and weaknesses but also the characteristics of the benchmark dataset. Our case study with the DramaQA dataset demonstrates a refined analysis of the model and the benchmark dataset. We argue the need for metrics based on understanding the nature of tasks and designed to align closely with human cognitive processes. This approach provides insights beyond traditional overall scores and paves the way for more sophisticated AI development targeting higher cognitive functions.

CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding

TL;DR

CogME addresses the insufficiency of single-number metrics for evaluating video story understanding by introducing a cognition-inspired, multi-dimensional framework. It decomposes understanding into TARGET, CONTENT, and THINKING, grounded in human cognitive processes and Bloom's taxonomy, and applies this scheme to DramaQA with manual annotation to produce rich diagnostic profiles. The approach reveals model strengths and dataset biases that conventional QA accuracy misses, enabling targeted model development and more balanced data curation. This framework has potential to guide evaluation and design across multimodal storytelling tasks, advancing AI toward higher-order cognitive capabilities.

Abstract

We introduce CogME, a cognition-inspired, multi-dimensional evaluation metric designed for AI models focusing on story understanding. CogME is a framework grounded in human thinking strategies and story elements that involve story understanding. With a specific breakdown of the questions, this approach provides a nuanced assessment revealing not only AI models' particular strengths and weaknesses but also the characteristics of the benchmark dataset. Our case study with the DramaQA dataset demonstrates a refined analysis of the model and the benchmark dataset. We argue the need for metrics based on understanding the nature of tasks and designed to align closely with human cognitive processes. This approach provides insights beyond traditional overall scores and paves the way for more sophisticated AI development targeting higher cognitive functions.

Paper Structure

This paper contains 19 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An illustration of CogME framework for an example of DramaQA dataset. It shows a situation in which an Agent predicts the Answer from the given Video clip and Question. Orange arrows indicate the process involves three story understanding components: TARGET, CONTENT, and THINKING.
  • Figure 2: Examples of tags applied to questions (a) Cases of tagged to questions in shot-level video, which require simple recall. (b) Cases of tagged to questions in scene-level video, which require comprehensive reasoning.
  • Figure 3: Performance profiles of two models. The vertex of each polygon represents the ratio (%) of correct predictions for the DramaQA dataset. Each radar plot represents TARGET(left), CONTENT(middle), THINKING(right) component. Light blue areas indicate the performance profiles of Agent I (MCM model), and pink areas display the performance profiles of Agent II (MemN2N model).
  • Figure 4: Frequencies of sub-components tagged in the questions of the DramaQA dataset. Each bar shows the number of times a sub-component was labeled out of 4,385 questions.