Table of Contents
Fetching ...

Visual cognition in multimodal large language models

Luca M. Schulze Buschoff, Elif Akata, Matthias Bethge, Eric Schulz

TL;DR

Evaluated vision-based large language models in the domains of intuitive physics, causal reasoning and intuitive psychology demonstrate that while some models exhibit proficient visual data processing capabilities, they still fall short of human performance in these cognitive domains.

Abstract

A chief goal of artificial intelligence is to build machines that think like people. Yet it has been argued that deep neural network architectures fail to accomplish this. Researchers have asserted these models' limitations in the domains of causal reasoning, intuitive physics, and intuitive psychology. Yet recent advancements, namely the rise of large language models, particularly those designed for visual processing, have rekindled interest in the potential to emulate human-like cognitive abilities. This paper evaluates the current state of vision-based large language models in the domains of intuitive physics, causal reasoning, and intuitive psychology. Through a series of controlled experiments, we investigate the extent to which these modern models grasp complex physical interactions, causal relationships, and intuitive understanding of others' preferences. Our findings reveal that, while some of these models demonstrate a notable proficiency in processing and interpreting visual data, they still fall short of human capabilities in these areas. Our results emphasize the need for integrating more robust mechanisms for understanding causality, physical dynamics, and social cognition into modern-day, vision-based language models, and point out the importance of cognitively-inspired benchmarks.

Visual cognition in multimodal large language models

TL;DR

Evaluated vision-based large language models in the domains of intuitive physics, causal reasoning and intuitive psychology demonstrate that while some models exhibit proficient visual data processing capabilities, they still fall short of human performance in these cognitive domains.

Abstract

A chief goal of artificial intelligence is to build machines that think like people. Yet it has been argued that deep neural network architectures fail to accomplish this. Researchers have asserted these models' limitations in the domains of causal reasoning, intuitive physics, and intuitive psychology. Yet recent advancements, namely the rise of large language models, particularly those designed for visual processing, have rekindled interest in the potential to emulate human-like cognitive abilities. This paper evaluates the current state of vision-based large language models in the domains of intuitive physics, causal reasoning, and intuitive psychology. Through a series of controlled experiments, we investigate the extent to which these modern models grasp complex physical interactions, causal relationships, and intuitive understanding of others' preferences. Our findings reveal that, while some of these models demonstrate a notable proficiency in processing and interpreting visual data, they still fall short of human capabilities in these areas. Our results emphasize the need for integrating more robust mechanisms for understanding causality, physical dynamics, and social cognition into modern-day, vision-based language models, and point out the importance of cognitively-inspired benchmarks.
Paper Structure (11 sections, 12 figures)

This paper contains 11 sections, 12 figures.

Figures (12)

  • Figure 1: Overview of domains tasks, approach, and models. A: Example images for the different experiments. Each experiment was taken from one of three cognitive domains: intuitive physics, causal reasoning, and intuitive psychology. B: General approach. For every query, an image was submitted to the model, and different questions were asked about the image, i.e. we performed visual question answering. C: Used multi-modal large language models and their size.
  • Figure 2: Results for five vision large language models for tasks of increasing complexity given images of real block towers from Lerer et al. lerer2016learning. We first ask for the background color in the image (A), then the color of blocks from top to bottom (B), and finally a binary stability rating for the block towers (C). The last plot shows the square root of the $R^2$ value for the Bayesian logistic mixed effects regression between models and human subjects (D). Error bars in plots A - C are given by the standard deviation of a binomial distribution. Error bars in plot D are given by the square root of the 95% percentiles for the Bayesian $R^2$ value.
  • Figure 3: Results for causal reasoning experiment from Zhou et al. zhou2023mental. We first ask for the number of blocks in the image (A), then the number of blocks that would fall if a specific block is removed (B and C), and finally a rating between 0 and 100 for how responsible a specific block is for the stability of the tower (D). For the responsibility ratings, all LLMs except for GPT-4V give constant ratings: Fuyu and Claude-3 always respond with 100, while Otter and LLaMA-Adapter V2 always respond with 50. Error bars in plots A - C are given by the standard error of the mean, while the error bar plot D is given by the square root of the 95% percentiles for the Bayesian $R^2$ value.
  • Figure 4: Results for causal reasoning experiment taken from Gerstenberg et al. gerstenberg2017eye. We first ask for the background color in the image (A), then the direction of ball movement (B), a judgement between 0 and 100 on whether ball "B" goes through the gate (C), and finally a counterfactual judgement between 0 and 100 on whether ball "B" would have gone through the gate, had ball "A" not been present in the scene (D). The error bars in plots C and D are given by the the square root of the 95% percentiles for the Bayesian $R^2$.
  • Figure 5: Results on tasks for intuitive psychology taken from Jara-Ettinger et al. jara2020naive. Again, we first ask for the background color (A) and the number of boxes in the scene (B). Models are then asked to make inferences about the costs and rewards in an environment depending on the path an agent has taken (C and D). Regression coefficients for Fuyu and LLaMA-Adapter V2 are missing as they always responded with constant ratings for either cost or reward questions. Error bars in plot A are given by the standard deviation of a binomial distribution, while the error bars in plots C and D are given by the square root of the 95% percentiles for the Bayesian $R^2$ value.
  • ...and 7 more figures