Machine Psychophysics: Cognitive Control in Vision-Language Models
Dezhi Luo, Maijunxian Wang, Bingyang Wang, Tianwei Zhao, Yijiang Li, Hokin Deng
TL;DR
The paper investigates whether cognitive control—central to flexible, goal-directed behavior—emerges in vision-language models (VLMs) by adapting classic conflict tasks (Stroop, Flanker) and their more demanding squared variants to 108 VLMs across 2,220 trials. It develops a controlled benchmarking framework with synthetic, standardized stimuli and a unified evaluation toolkit to compare models spanning 1B–110B parameters. Results show human-like congruency effects, pronounced inter-model differences under higher cognitive demand, and scaling patterns indicating larger models better resist interference, especially in squared tasks. The findings suggest cognitive control can arise from large-scale associative learning in multimodal systems, with implications for artificial general intelligence and the design of more flexible AI systems.
Abstract
Cognitive control refers to the ability to flexibly coordinate thought and action in pursuit of internal goals. A standard method for assessing cognitive control involves conflict tasks that contrast congruent and incongruent trials, measuring the ability to prioritize relevant information while suppressing interference. We evaluate 108 vision-language models on three classic conflict tasks and their more demanding "squared" variants across 2,220 trials. Model performance corresponds closely to human behavior under resource constraints and reveals individual differences. These results indicate that some form of human-like executive function have emerged in current multi-modal foundational models.
