Social Welfare Function Leaderboard: When LLM Agents Allocate Social Welfare
Zhengliang Shi, Ruotian Ma, Jen-tse Huang, Xinbei Ma, Xingyu Chen, Mengru Wang, Qu Yang, Yue Wang, Fanghua Ye, Ziyang Chen, Shanyi Wang, Cixing Li, Wenxuan Wang, Zhaopeng Tu, Xiaolong Li, Zhaochun Ren, Linus
TL;DR
The paper introduces the Social Welfare Function (SWF) Benchmark, a dynamic, long-horizon simulation in which an LLM acts as a sovereign allocator balancing efficiency and fairness in task distributions. It defines ROI as the efficiency metric and the Gini coefficient as the fairness metric, combining them into the SWF score SWF = (1 − Gini) × ROI to evaluate models. Through a 63-task-flow framework with 12 heterogeneous recipient agents and 20 SOTA LLMs, the authors show that general conversational ability poorly predicts allocation skill, reveal a prevalent utilitarian bias favoring efficiency, and demonstrate that allocation strategies are highly sensitive to output length and social-influence framing. The results underscore the risks of deploying LLMs as societal decision-makers without targeted alignment and specialized benchmarks, and they point to directions for governance-friendly AI, including normative prompts and explicit ethical constraints. These findings have practical implications for AI governance, content-policy design, and the development of robust evaluation tools tailored to welfare-distribution tasks.
Abstract
Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare. However, the principles and values that guide these models when distributing scarce societal resources remain largely unexamined. To address this, we introduce the Social Welfare Function (SWF) Benchmark, a dynamic simulation environment where an LLM acts as a sovereign allocator, distributing tasks to a heterogeneous community of recipients. The benchmark is designed to create a persistent trade-off between maximizing collective efficiency (measured by Return on Investment) and ensuring distributive fairness (measured by the Gini coefficient). We evaluate 20 state-of-the-art LLMs and present the first leaderboard for social welfare allocation. Our findings reveal three key insights: (i) A model's general conversational ability, as measured by popular leaderboards, is a poor predictor of its allocation skill. (ii) Most LLMs exhibit a strong default utilitarian orientation, prioritizing group productivity at the expense of severe inequality. (iii) Allocation strategies are highly vulnerable, easily perturbed by output-length constraints and social-influence framing. These results highlight the risks of deploying current LLMs as societal decision-makers and underscore the need for specialized benchmarks and targeted alignment for AI governance.
