Table of Contents
Fetching ...

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

Shaojie Shi, Zhengyu Shi, Lingran Zheng, Xinyu Su, Anna Xie, Bohao Lv, Rui Xu, Zijian Chen, Zhichao Chen, Guolei Liu, Naifu Zhang, Mingjian Dong, Zhuo Quan, Bohao Chen, Teqi Hao, Yuan Qi, Yinghui Xu, Libo Wu

Abstract

Causal inference in social science relies on end-to-end, intervention-centered research-design reasoning grounded in real-world policy interventions, but current benchmarks fail to evaluate this capability of large language models (LLMs). We present InterveneBench, a benchmark designed to assess such reasoning in realistic social settings. Each instance in InterveneBench is derived from an empirical social science study and requires models to reason about policy interventions and identification assumptions without access to predefined causal graphs or structural equations. InterveneBench comprises 744 peer-reviewed studies across diverse policy domains. Experimental results show that state-of-the-art LLMs struggle under this setting. To address this limitation, we further propose a multi-agent framework, STRIDES. It achieves significant performance improvements over state-of-the-art reasoning models. Our code and data are available at https://github.com/Sii-yuning/STRIDES.

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

Abstract

Causal inference in social science relies on end-to-end, intervention-centered research-design reasoning grounded in real-world policy interventions, but current benchmarks fail to evaluate this capability of large language models (LLMs). We present InterveneBench, a benchmark designed to assess such reasoning in realistic social settings. Each instance in InterveneBench is derived from an empirical social science study and requires models to reason about policy interventions and identification assumptions without access to predefined causal graphs or structural equations. InterveneBench comprises 744 peer-reviewed studies across diverse policy domains. Experimental results show that state-of-the-art LLMs struggle under this setting. To address this limitation, we further propose a multi-agent framework, STRIDES. It achieves significant performance improvements over state-of-the-art reasoning models. Our code and data are available at https://github.com/Sii-yuning/STRIDES.
Paper Structure (57 sections, 10 equations, 3 figures, 18 tables)

This paper contains 57 sections, 10 equations, 3 figures, 18 tables.

Figures (3)

  • Figure 1: Comparison between closed-form mathematical reasoning (Panel A) and open-ended social-science causal inference (Panel B).
  • Figure 2: Overview of our proposed system. The system has three stages. (1) Benchmark construction uses a Human-in-the-Loop MAS: a coordinator schedules a Paper Interpreter and Causal Designer to produce a draft causal design, a Verifier checks and routes low-quality designs for human review, and a Formatter converts approved designs into a standardized JSON schema as ground-truth designs. (2) STRIDES mirrors expert research workflows through three sequential modules. (i) The Strategic Research Design module maps unstructured metadata to statistical models using two specialized agents. (ii) The Data Environment Instantiation module employs a data retrieval agent to derive measurable indicators from model specifications and a simulation agent to generate mock data. (iii) The Code-Based Analysis and Verification module uses a code agent to produce executable statistical code and a critic agent to provide iterative feedback for design refinement, followed by a summary agent that standardizes qualified predicted designs. (3) Evaluation is performed by a separate grader module (not part of the generation pipeline), which compares model predictions against expert-verified ground truth and outputs evaluation scores.
  • Figure 3: Case Study of Generating InterveneBench. It presents a comparison between the raw output derived from the paper hammarlund2025impact generated by our original LLM-based multi-agent system and the final output refined through human expert validation.