Table of Contents
Fetching ...

Machine Psychophysics: Cognitive Control in Vision-Language Models

Dezhi Luo, Maijunxian Wang, Bingyang Wang, Tianwei Zhao, Yijiang Li, Hokin Deng

TL;DR

The paper investigates whether cognitive control—central to flexible, goal-directed behavior—emerges in vision-language models (VLMs) by adapting classic conflict tasks (Stroop, Flanker) and their more demanding squared variants to 108 VLMs across 2,220 trials. It develops a controlled benchmarking framework with synthetic, standardized stimuli and a unified evaluation toolkit to compare models spanning 1B–110B parameters. Results show human-like congruency effects, pronounced inter-model differences under higher cognitive demand, and scaling patterns indicating larger models better resist interference, especially in squared tasks. The findings suggest cognitive control can arise from large-scale associative learning in multimodal systems, with implications for artificial general intelligence and the design of more flexible AI systems.

Abstract

Cognitive control refers to the ability to flexibly coordinate thought and action in pursuit of internal goals. A standard method for assessing cognitive control involves conflict tasks that contrast congruent and incongruent trials, measuring the ability to prioritize relevant information while suppressing interference. We evaluate 108 vision-language models on three classic conflict tasks and their more demanding "squared" variants across 2,220 trials. Model performance corresponds closely to human behavior under resource constraints and reveals individual differences. These results indicate that some form of human-like executive function have emerged in current multi-modal foundational models.

Machine Psychophysics: Cognitive Control in Vision-Language Models

TL;DR

The paper investigates whether cognitive control—central to flexible, goal-directed behavior—emerges in vision-language models (VLMs) by adapting classic conflict tasks (Stroop, Flanker) and their more demanding squared variants to 108 VLMs across 2,220 trials. It develops a controlled benchmarking framework with synthetic, standardized stimuli and a unified evaluation toolkit to compare models spanning 1B–110B parameters. Results show human-like congruency effects, pronounced inter-model differences under higher cognitive demand, and scaling patterns indicating larger models better resist interference, especially in squared tasks. The findings suggest cognitive control can arise from large-scale associative learning in multimodal systems, with implications for artificial general intelligence and the design of more flexible AI systems.

Abstract

Cognitive control refers to the ability to flexibly coordinate thought and action in pursuit of internal goals. A standard method for assessing cognitive control involves conflict tasks that contrast congruent and incongruent trials, measuring the ability to prioritize relevant information while suppressing interference. We evaluate 108 vision-language models on three classic conflict tasks and their more demanding "squared" variants across 2,220 trials. Model performance corresponds closely to human behavior under resource constraints and reveals individual differences. These results indicate that some form of human-like executive function have emerged in current multi-modal foundational models.

Paper Structure

This paper contains 22 sections, 5 figures.

Figures (5)

  • Figure 1: Standard Tasks. In the Stroop task, models were asked to indicate the color a word is printed in while disregarding the word’s meaning. In the Flanker tasks, models were asked to identify either the central letter or number while ignoring the surrounding distractors ("flankers").
  • Figure 2: Squared Tasks. In Stroop Squared, models were asked to select the response option whose word meaning matches the display color of the target word. In Flanker Squared, they choose the option where the central letter or number matches the identity of the surrounding distractors in the target stimulus. The correct response for all example trials shown is the option on the right.
  • Figure 3: Control Tasks. Top row (left to right): Color recognition, word recognition, color-word binding (color recognition with a word or word recognition with color), and spatial recognition with combined word and color cues. Bottom row (left to right): Arrow-based control (reporting directions as opposed to characters) with fewer flankers, central character identification, surrounding character detection, and spatial recognition with characters. For the two Squared-like tasks, models were asked to directly report the content that are considered the left or right option.
  • Figure 4: Model Performances on Standard and Squared Tasks Compared Between Conditions.
  • Figure 5: Model Performance in Relation to Scaling