Table of Contents
Fetching ...

Cognitive Science-Inspired Evaluation of Core Capabilities for Object Understanding in AI

Danaja Rutar, Alva Markelius, Konstantinos Voudouris, José Hernández-Orallo, Lucy Cheke

TL;DR

The paper tackles the problem of building AI with a robust, unified object understanding by integrating cognitive-science theories of objecthood (Gestalt, enactive, and developmental perspectives) with a critical evaluation of AI paradigms. It argues that current AI benchmarks typically assess isolated object capabilities and fail to capture the functional integration across perceptual grouping, behavioural prediction, and affordance reasoning that underpins human object understanding. By mapping AI approaches onto a two-dimensional space of knowledge about the world and interaction with the world, the authors identify four paradigms and analyze their strengths and limitations, highlighting the need for cohesive architectures where inputs from one capability inform others. A promising direction is the LLM-AAI framework that blends large-language models with embodied simulation to jointly leverage knowledge and interaction, though practical challenges remain in embodiment and continual learning. Overall, the paper proposes a roadmap for moving from fragmented object capabilities toward generalized, context-aware object understanding in AI, with implications for more robust world models in real-world tasks.

Abstract

One of the core components of our world models is 'intuitive physics' - an understanding of objects, space, and causality. This capability enables us to predict events, plan action and navigate environments, all of which rely on a composite sense of objecthood. Despite its importance, there is no single, unified account of objecthood, though multiple theoretical frameworks provide insights. In the first part of this paper, we present a comprehensive overview of the main theoretical frameworks in objecthood research - Gestalt psychology, enactive cognition, and developmental psychology - and identify the core capabilities each framework attributes to object understanding, as well as what functional roles they play in shaping world models in biological agents. Given the foundational role of objecthood in world modelling, understanding objecthood is also essential in AI. In the second part of the paper, we evaluate how current AI paradigms approach and test objecthood capabilities compared to those in cognitive science. We define an AI paradigm as a combination of how objecthood is conceptualised, the methods used for studying objecthood, the data utilised, and the evaluation techniques. We find that, whilst benchmarks can detect that AI systems model isolated aspects of objecthood, the benchmarks cannot detect when AI systems lack functional integration across these capabilities, not solving the objecthood challenge fully. Finally, we explore novel evaluation approaches that align with the integrated vision of objecthood outlined in this paper. These methods are promising candidates for advancing from isolated object capabilities toward general-purpose AI with genuine object understanding in real-world contexts.

Cognitive Science-Inspired Evaluation of Core Capabilities for Object Understanding in AI

TL;DR

The paper tackles the problem of building AI with a robust, unified object understanding by integrating cognitive-science theories of objecthood (Gestalt, enactive, and developmental perspectives) with a critical evaluation of AI paradigms. It argues that current AI benchmarks typically assess isolated object capabilities and fail to capture the functional integration across perceptual grouping, behavioural prediction, and affordance reasoning that underpins human object understanding. By mapping AI approaches onto a two-dimensional space of knowledge about the world and interaction with the world, the authors identify four paradigms and analyze their strengths and limitations, highlighting the need for cohesive architectures where inputs from one capability inform others. A promising direction is the LLM-AAI framework that blends large-language models with embodied simulation to jointly leverage knowledge and interaction, though practical challenges remain in embodiment and continual learning. Overall, the paper proposes a roadmap for moving from fragmented object capabilities toward generalized, context-aware object understanding in AI, with implications for more robust world models in real-world tasks.

Abstract

One of the core components of our world models is 'intuitive physics' - an understanding of objects, space, and causality. This capability enables us to predict events, plan action and navigate environments, all of which rely on a composite sense of objecthood. Despite its importance, there is no single, unified account of objecthood, though multiple theoretical frameworks provide insights. In the first part of this paper, we present a comprehensive overview of the main theoretical frameworks in objecthood research - Gestalt psychology, enactive cognition, and developmental psychology - and identify the core capabilities each framework attributes to object understanding, as well as what functional roles they play in shaping world models in biological agents. Given the foundational role of objecthood in world modelling, understanding objecthood is also essential in AI. In the second part of the paper, we evaluate how current AI paradigms approach and test objecthood capabilities compared to those in cognitive science. We define an AI paradigm as a combination of how objecthood is conceptualised, the methods used for studying objecthood, the data utilised, and the evaluation techniques. We find that, whilst benchmarks can detect that AI systems model isolated aspects of objecthood, the benchmarks cannot detect when AI systems lack functional integration across these capabilities, not solving the objecthood challenge fully. Finally, we explore novel evaluation approaches that align with the integrated vision of objecthood outlined in this paper. These methods are promising candidates for advancing from isolated object capabilities toward general-purpose AI with genuine object understanding in real-world contexts.

Paper Structure

This paper contains 19 sections, 1 figure.

Figures (1)

  • Figure 1: The four conceptual quadrants of paradigms of objecthood in AI and the example paradigms discussed in this section. Top-left quadrant (1): No Interaction, High Knowledge about the World Object Understanding. Top-right quadrant (2): Combining Knowledge of the World with Interaction with the World. Bottom-left quadrant (3): No Interaction, Low Knowledge about the World Object Understanding. Bottom-right quadrant (4): High Interaction, No Knowledge about the World Object Understanding.