Table of Contents
Fetching ...

Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models

Bin Fu, Qiyang Wan, Jialin Li, Ruiping Wang, Xilin Chen

TL;DR

This work tackles the fundamental question of whether Large Multimodal Models possess ground-truth categorization abilities by introducing ComBo, a synthetic benchmark that disentangles category learning (perception) from category use (classification). ComBo uses Composite Blocks with fully controllable attributes, rendered across 20 viewpoints to produce 190,080 images for 9,504 objects, enabling three tasks: Pattern Perception, Abstraction Alignment, and Category Building. Evaluations across GPT-4V, Gemini, and open-source LMMs show that while LMMs outperform traditional computer vision baselines in some aspects, they still lag humans in spatial detail and abstract reasoning; in-context learning and Chain-of-Thought improve performance but do not bridge the gap. The benchmark offers a principled framework for diagnosing fundamental categorization capabilities and guiding future improvements toward more interpretable and generalizable multimodal cognition.

Abstract

Categorization, a core cognitive ability in humans that organizes objects based on common features, is essential to cognitive science as well as computer vision. To evaluate the categorization ability of visual AI models, various proxy tasks on recognition from datasets to open world scenarios have been proposed. Recent development of Large Multimodal Models (LMMs) has demonstrated impressive results in high-level visual tasks, such as visual question answering, video temporal reasoning, etc., utilizing the advanced architectures and large-scale multimodal instruction tuning. Previous researchers have developed holistic benchmarks to measure the high-level visual capability of LMMs, but there is still a lack of pure and in-depth quantitative evaluation of the most fundamental categorization ability. According to the research on human cognitive process, categorization can be seen as including two parts: category learning and category use. Inspired by this, we propose a novel, challenging, and efficient benchmark based on composite blocks, called ComBo, which provides a disentangled evaluation framework and covers the entire categorization process from learning to use. By analyzing the results of multiple evaluation tasks, we find that although LMMs exhibit acceptable generalization ability in learning new categories, there are still gaps compared to humans in many ways, such as fine-grained perception of spatial relationship and abstract category understanding. Through the study of categorization, we can provide inspiration for the further development of LMMs in terms of interpretability and generalization.

Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models

TL;DR

This work tackles the fundamental question of whether Large Multimodal Models possess ground-truth categorization abilities by introducing ComBo, a synthetic benchmark that disentangles category learning (perception) from category use (classification). ComBo uses Composite Blocks with fully controllable attributes, rendered across 20 viewpoints to produce 190,080 images for 9,504 objects, enabling three tasks: Pattern Perception, Abstraction Alignment, and Category Building. Evaluations across GPT-4V, Gemini, and open-source LMMs show that while LMMs outperform traditional computer vision baselines in some aspects, they still lag humans in spatial detail and abstract reasoning; in-context learning and Chain-of-Thought improve performance but do not bridge the gap. The benchmark offers a principled framework for diagnosing fundamental categorization capabilities and guiding future improvements toward more interpretable and generalizable multimodal cognition.

Abstract

Categorization, a core cognitive ability in humans that organizes objects based on common features, is essential to cognitive science as well as computer vision. To evaluate the categorization ability of visual AI models, various proxy tasks on recognition from datasets to open world scenarios have been proposed. Recent development of Large Multimodal Models (LMMs) has demonstrated impressive results in high-level visual tasks, such as visual question answering, video temporal reasoning, etc., utilizing the advanced architectures and large-scale multimodal instruction tuning. Previous researchers have developed holistic benchmarks to measure the high-level visual capability of LMMs, but there is still a lack of pure and in-depth quantitative evaluation of the most fundamental categorization ability. According to the research on human cognitive process, categorization can be seen as including two parts: category learning and category use. Inspired by this, we propose a novel, challenging, and efficient benchmark based on composite blocks, called ComBo, which provides a disentangled evaluation framework and covers the entire categorization process from learning to use. By analyzing the results of multiple evaluation tasks, we find that although LMMs exhibit acceptable generalization ability in learning new categories, there are still gaps compared to humans in many ways, such as fine-grained perception of spatial relationship and abstract category understanding. Through the study of categorization, we can provide inspiration for the further development of LMMs in terms of interpretability and generalization.
Paper Structure (31 sections, 25 figures, 3 tables)

This paper contains 31 sections, 25 figures, 3 tables.

Figures (25)

  • Figure 1: Human behavior in categorization. People can group objects together based on common patterns, form mental representation of categories, and classify novel items.
  • Figure 2: The cognitive processes of humans and LMMs in categorization. Categorization can be modeled as a process of category learning and category use between concrete and abstract spaces. The proposed evaluation tasks are shown in green blocks.
  • Figure 3: Three progressive tasks on categorization evaluation. (a) Pattern Perception: Evaluating LMMs' low-level pattern recognition ability. (b) Abstraction Alignment: Comparing the category abstract representations between humans and LMMs. (c) Category Building: Examining LMMs' categorization ability on abstract unseen categories.
  • Figure 4: Overview of Composite Blocks (ComBo) dataset: exemplar images and attributes. Each object can be represented by four-dimension fully-disentangled attributes as shape, color, material, and contact point between the primary primitive and the secondary primitive.
  • Figure 5: Examples of the QA pairs for three evaluation tasks. Due to space constraints, prompts and answers are abbreviated. Refer to supplementary materials for details.
  • ...and 20 more figures