LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

Viet-Thanh Pham; Lizhen Qu; Thuy-Trang Vu; Gholamreza Haffari; Dinh Phung

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

Viet-Thanh Pham, Lizhen Qu, Thuy-Trang Vu, Gholamreza Haffari, Dinh Phung

TL;DR

LiveCultureBench is introduced, a multi-cultural, dynamic benchmark that embeds LLMs as agents in a simulated town and evaluates them on both task completion and adherence to socio-cultural norms.

Abstract

Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce LiveCultureBench, a multi-cultural, dynamic benchmark that embeds LLMs as agents in a simulated town and evaluates them on both task completion and adherence to socio-cultural norms. The simulation models a small city as a location graph with synthetic residents having diverse demographic and cultural profiles. Each episode assigns one resident a daily goal while others provide social context. An LLM-based verifier generates structured judgments on norm violations and task progress, which we aggregate into metrics capturing task-norm trade-offs and verifier uncertainty. Using LiveCultureBench across models and cultural profiles, we study (i) cross-cultural robustness of LLM agents, (ii) how they balance effectiveness against norm sensitivity, and (iii) when LLM-as-a-judge evaluation is reliable for automated benchmarking versus when human oversight is needed.

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

TL;DR

LiveCultureBench is introduced, a multi-cultural, dynamic benchmark that embeds LLMs as agents in a simulated town and evaluates them on both task completion and adherence to socio-cultural norms.

Abstract

Paper Structure (61 sections, 7 equations, 5 figures, 15 tables)

This paper contains 61 sections, 7 equations, 5 figures, 15 tables.

Introduction
LiveCultureBench
Environment and Time Model
Time Representation.
Spatial Representation.
Agents and Profiles
Profile sampling.
Internal memory.
State and Action Spaces
State space.
Action space.
State transition.
Goals and Subtasks for the Target Agent
Cultural Norms and Supporting Agents
Location norms.
...and 46 more sections

Figures (5)

Figure 1: Illustration of our proposed social simulation framework, LiveCultureBench. LLM-based agents are spawned in a dynamic town environment, and a dedicated Verifier Agent living outside of the simulation is used to evaluate the Target Agent's performance and behaviors on task completion and cultural norm adherence.
Figure 2: Target Agent performance from different LLM backbones.
Figure 3: Analysis of performance of different LLMs when (i) interacting in multicultural scenarios, and (ii) interacting in different locations.
Figure 4: Conformal sampling results for different LLMs as our Verifier Agent.
Figure 5: Target Agent performance from different LLM backbones.

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

TL;DR

Abstract

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

Authors

TL;DR

Abstract

Table of Contents

Figures (5)