Scalable Delphi: Large Language Models for Structured Risk Estimation

Tobias Lorenz; Mario Fritz

Scalable Delphi: Large Language Models for Structured Risk Estimation

Tobias Lorenz, Mario Fritz

TL;DR

This work introduces Scalable Delphi, an LLM-driven approach to structured expert elicitation that preserves the multi-round, feedback-driven Delphi protocol using diverse expert personas. By evaluating calibration against verifiable proxies, sensitivity to evidence, and alignment with human judgments in AI-augmented cybersecurity risk, the authors demonstrate that LLM panels achieve high correlations with ground-truth benchmarks ($r$ in the range $0.87$ to $0.95$) and improve as evidence accumulates. The method also aligns with independent human expert panels, sometimes more closely than human panels align with each other, while dramatically reducing elicitation time from months to minutes. This suggests LLM-based elicitation can extend structured risk assessment to domains where traditional methods are infeasible, enabling scalable, auditable, and rapidly updatable risk models with broad applicability beyond cybersecurity.

Abstract

Quantitative risk assessment in high-stakes domains relies on structured expert elicitation to estimate unobservable properties. The gold standard - the Delphi method - produces calibrated, auditable judgments but requires months of coordination and specialist time, placing rigorous risk assessment out of reach for most applications. We investigate whether Large Language Models (LLMs) can serve as scalable proxies for structured expert elicitation. We propose Scalable Delphi, adapting the classical protocol for LLMs with diverse expert personas, iterative refinement, and rationale sharing. Because target quantities are typically unobservable, we develop an evaluation framework based on necessary conditions: calibration against verifiable proxies, sensitivity to evidence, and alignment with human expert judgment. We evaluate in the domain of AI-augmented cybersecurity risk, using three capability benchmarks and independent human elicitation studies. LLM panels achieve strong correlations with benchmark ground truth (Pearson r=0.87-0.95), improve systematically as evidence is added, and align with human expert panels - in one comparison, closer to a human panel than the two human panels are to each other. This demonstrates that LLM-based elicitation can extend structured expert judgment to settings where traditional methods are infeasible, reducing elicitation time from months to minutes.

Scalable Delphi: Large Language Models for Structured Risk Estimation

TL;DR

in the range

) and improve as evidence accumulates. The method also aligns with independent human expert panels, sometimes more closely than human panels align with each other, while dramatically reducing elicitation time from months to minutes. This suggests LLM-based elicitation can extend structured risk assessment to domains where traditional methods are infeasible, enabling scalable, auditable, and rapidly updatable risk models with broad applicability beyond cybersecurity.

Abstract

Paper Structure (25 sections, 2 equations, 3 figures, 4 tables)

This paper contains 25 sections, 2 equations, 3 figures, 4 tables.

Introduction
Scalable Delphi Method
The Estimation Task
Elicitation Protocol
Design Rationale
Evaluation Framework
Necessary Conditions
Corroborating Evidence
Experiments
Experimental Setup
Calibration
Evidence Sensitivity
Qualitative Analysis
Expert Alignment
Discussion and Limitations
...and 10 more sections

Figures (3)

Figure 1: Calibration: predicted vs. actual success rates. Top: scatter plots with mean estimates. Bottom: summary statistics. Dashed line indicates perfect calibration.
Figure 2: Evidence sensitivity: Pearson correlation with ground truth across information conditions. Performance increases as decision-relevant information is added, confirming estimates reflect reasoning about provided evidence.
Figure 3: Expert alignment: LLM estimates compared to human expert panels from murray2025mapping. Tasks ordered by difficulty (easy to hard) based on human first-solve time. Bars show panel means; error bars indicate 95% confidence intervals. All estimates after the final Delphi round.

Scalable Delphi: Large Language Models for Structured Risk Estimation

TL;DR

Abstract

Scalable Delphi: Large Language Models for Structured Risk Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)