Table of Contents
Fetching ...

What Would an LLM Do? Evaluating Large Language Models for Policymaking to Alleviate Homelessness

Pierre Le Coz, Jia An Liu, Debarun Bhattacharjya, Georgina Curto, Serge Stinckwich

TL;DR

This work addresses homelessness policymaking by evaluating whether large language models align with domain experts through a CA-grounded benchmark across four cities and a universal context. It operationalizes the CA in a computational ABM framework by mapping policy proposals to a SAT matrix with $14$ needs and $11$ actions, enabling prospectively testable policy impact on PEH outcomes. The contributions include a novel cross-city benchmark, expert baselines, and an automated LLM-ABM pipeline that translates narrative policy proposals into agent behavior to compare social impacts. Findings show substantial variation in LLM policy preferences across models and contexts, but LLMs can achieve comparable or better aggregate PEH needs satisfaction in ABMs when guided by CA framing and local calibration, underscoring the need for guardrails and local expertise in deployment. The work advances scalable, dignity-centered, policy-testing methods for homelessness and informs responsible, context-aware use of LLMs in civic decision-making.

Abstract

Large language models (LLMs) are increasingly being adopted in high-stakes domains. Their potential to encode evolving social contexts and to generate plausible scenarios position them as promising tools in social policymaking. This article evaluates whether LLMs are aligned with domain experts (and among themselves) on policy recommendations to alleviate homelessness - a challenge affecting over 150 million people worldwide. We develop a novel benchmark comprised of decision scenarios across four cities, with policy choices that are grounded in the conceptual framework of the Capability Approach for human development. We also present an automated pipeline that connects the policies to an agent-based model in one location, and compare the social impact of the policies recommended by LLMs to those recommended by experts. Our exploratory analysis reveals variation across LLMs in their policy recommendations compared to local experts, yet suggests potential benefits of the use of LLMs to provide insights for policymaking, if paired with responsible guardrails, contextual calibrations, and local domain expertise. Our work operationalizes the Capability Approach in a computational framework and provides new insights on homelessness alleviation policymaking with a focus on human dignity.

What Would an LLM Do? Evaluating Large Language Models for Policymaking to Alleviate Homelessness

TL;DR

This work addresses homelessness policymaking by evaluating whether large language models align with domain experts through a CA-grounded benchmark across four cities and a universal context. It operationalizes the CA in a computational ABM framework by mapping policy proposals to a SAT matrix with needs and actions, enabling prospectively testable policy impact on PEH outcomes. The contributions include a novel cross-city benchmark, expert baselines, and an automated LLM-ABM pipeline that translates narrative policy proposals into agent behavior to compare social impacts. Findings show substantial variation in LLM policy preferences across models and contexts, but LLMs can achieve comparable or better aggregate PEH needs satisfaction in ABMs when guided by CA framing and local calibration, underscoring the need for guardrails and local expertise in deployment. The work advances scalable, dignity-centered, policy-testing methods for homelessness and informs responsible, context-aware use of LLMs in civic decision-making.

Abstract

Large language models (LLMs) are increasingly being adopted in high-stakes domains. Their potential to encode evolving social contexts and to generate plausible scenarios position them as promising tools in social policymaking. This article evaluates whether LLMs are aligned with domain experts (and among themselves) on policy recommendations to alleviate homelessness - a challenge affecting over 150 million people worldwide. We develop a novel benchmark comprised of decision scenarios across four cities, with policy choices that are grounded in the conceptual framework of the Capability Approach for human development. We also present an automated pipeline that connects the policies to an agent-based model in one location, and compare the social impact of the policies recommended by LLMs to those recommended by experts. Our exploratory analysis reveals variation across LLMs in their policy recommendations compared to local experts, yet suggests potential benefits of the use of LLMs to provide insights for policymaking, if paired with responsible guardrails, contextual calibrations, and local domain expertise. Our work operationalizes the Capability Approach in a computational framework and provides new insights on homelessness alleviation policymaking with a focus on human dignity.

Paper Structure

This paper contains 45 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Methodology overview: We construct a benchmark via a structured prompting strategy grounded in the Capability Approach for human development (CA) prompt various LLMs to act as policymakers, analyze their choices through comparison with human expert recommendations, and conduct an evaluation of their projected societal impact using a modular agent-based modeling pipeline.
  • Figure 2: Pairwise comparison of the choices and judgments of various LLMs using heat maps of the following: a) the fraction of common top choices, b) similarity between justifications for top choices (as measured using Sentence-BERT embeddings), and c) comparison of rankings of choices (as measured by the normalized Kendall tau distance).
  • Figure 3: Comparison of capabilities prioritized in the top policy choices of human experts and GPT-4.1 across all scenarios.
  • Figure 4: Comparing top choice overlap between 4 LLMs and the primary domain expert in Johannesburg, with and without prompting LLMs to consider the local context when selecting policies.
  • Figure 5: Pairwise comparison of the top choices of various LLMs (with each other as well as the local primary expert) using heat maps for the three geographic regions that included expert assessments. 10 contextualized scenarios are used while comparing experts with LLMs, whereas all 40 contextualized scenarios are considered while comparing LLMs with each other.