Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?
Zabir Al Nazi, G M Shahariar, Abrar Hossain, Wei Peng
TL;DR
This work introduces CulturalToM-VQA, a cross-cultural Vision-Language ToM benchmark consisting of 5,095 question–answer pairs across three data splits, designed to evaluate how culture shapes mental-state inference in multimodal models. It employs a HITL annotation pipeline to generate structured ToM scene descriptions and a six-task taxonomy across four complexity levels, with automated and human validation to ensure cultural fidelity. A broad evaluation of open-source VLMs reveals that newer architectures achieve strong zero-shot performance on explicit ToM tasks but falter on deeper, culture-grounded reasoning such as false belief, social norms, and multi-agent perspective coordination, highlighting the need for culturally informed reasoning in model alignment. The dataset enables systematic analysis of culture-driven ToM abilities and provides a framework to push toward genuinely culture-aware social intelligence in multimodal AI, while acknowledging ethical considerations and biases inherent in synthetic cultural cues and model-generated annotations.
Abstract
Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.
