Table of Contents
Fetching ...

Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models

Jayanta Sadhu, Ayan Antik Khan, Noshin Nawal, Sanju Basak, Abhik Bhattacharjee, Rifat Shahriyar

TL;DR

This work conducts extensive evaluations of six state-of-the-art LLMs to measure their ToM performance across both the translated and culturally adapted datasets, and highlights the influence of linguistic and cultural diversity on the models' ability to exhibit ToM.

Abstract

Theory of Mind (ToM) refers to the cognitive ability to infer and attribute mental states to oneself and others. As large language models (LLMs) are increasingly evaluated for social and cognitive capabilities, it remains unclear to what extent these models demonstrate ToM across diverse languages and cultural contexts. In this paper, we introduce a comprehensive study of multilingual ToM capabilities aimed at addressing this gap. Our approach includes two key components: (1) We translate existing ToM datasets into multiple languages, effectively creating a multilingual ToM dataset and (2) We enrich these translations with culturally specific elements to reflect the social and cognitive scenarios relevant to diverse populations. We conduct extensive evaluations of six state-of-the-art LLMs to measure their ToM performance across both the translated and culturally adapted datasets. The results highlight the influence of linguistic and cultural diversity on the models' ability to exhibit ToM, and questions their social reasoning capabilities. This work lays the groundwork for future research into enhancing LLMs' cross-cultural social cognition and contributes to the development of more culturally aware and socially intelligent AI systems. All our data and code are publicly available.

Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models

TL;DR

This work conducts extensive evaluations of six state-of-the-art LLMs to measure their ToM performance across both the translated and culturally adapted datasets, and highlights the influence of linguistic and cultural diversity on the models' ability to exhibit ToM.

Abstract

Theory of Mind (ToM) refers to the cognitive ability to infer and attribute mental states to oneself and others. As large language models (LLMs) are increasingly evaluated for social and cognitive capabilities, it remains unclear to what extent these models demonstrate ToM across diverse languages and cultural contexts. In this paper, we introduce a comprehensive study of multilingual ToM capabilities aimed at addressing this gap. Our approach includes two key components: (1) We translate existing ToM datasets into multiple languages, effectively creating a multilingual ToM dataset and (2) We enrich these translations with culturally specific elements to reflect the social and cognitive scenarios relevant to diverse populations. We conduct extensive evaluations of six state-of-the-art LLMs to measure their ToM performance across both the translated and culturally adapted datasets. The results highlight the influence of linguistic and cultural diversity on the models' ability to exhibit ToM, and questions their social reasoning capabilities. This work lays the groundwork for future research into enhancing LLMs' cross-cultural social cognition and contributes to the development of more culturally aware and socially intelligent AI systems. All our data and code are publicly available.

Paper Structure

This paper contains 27 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 1: A ToM sample data point consisting of - (a) the type of task, (b) a story capturing a specific scenario, (c) the type of ability being assessed, (d) a question assessing the model's ability to infer the emotions or underlying intentions of a character, (e) multiple answer options providing plausible explanations about the character's actions, with only one being the correct interpretation based on ToM principles.
  • Figure 2: Process of cultural element induction in a discrepant emotions task. The generic story is culturally adapted to reflect a Western context. Despite the cultural modifications, the core narrative remains unchanged.
  • Figure 3: Comparative analysis between LLM's abilities on ToM tasks and abilities for different setup. (Note that the average scores for each task and abilities in Figure \ref{['subfig:english_task_comparison']} and \ref{['subfig:russian_task_comparison']} are averaged over the top three performing LLMs: Claude-3.5-Sonnet, GPT-4o and LLama-3.1-8b)
  • Figure 4: Multi-step translation process for MultiToM. The Data-Handler provides each data point along with necessary metadata. Agent-1 (GPT-4) translates the data point into the specified language. Agent-2 (GPT-3.5) reviews the translation, comparing it with the original text and suggesting possible modifications. Finally, Agent-3 (GPT-3.5) refines the translation based on the feedback, and the Data-Handler saves the final version.
  • Figure 5: Comparison of Generic and Culturally adapted Tasks: illustrating model tendency to change answer choices when cultural nuances are added
  • ...and 4 more figures