Agent-Supported Foresight for AI Systemic Risks: AI Agents for Breadth, Experts for Judgment

Leon Fröhling; Alessandro Giaconia; Edyta Paulina Bogucka; Daniele Quercia

Agent-Supported Foresight for AI Systemic Risks: AI Agents for Breadth, Experts for Judgment

Leon Fröhling, Alessandro Giaconia, Edyta Paulina Bogucka, Daniele Quercia

TL;DR

Facing the Collingridge dilemma, the paper proposes agent-supported foresight to surface long-term AI systemic risks. It combines the Futures Wheel with Plurals-based in-silico agents to generate cascading consequences across four AI uses representing a range of TRLs, then benchmarks agent outputs against domain experts and laypeople. The results show agents broaden risk coverage and surface many systemic risks, while humans provide grounding, context, and prioritization, supporting a hybrid foresight workflow. The work advances scalable, inclusive foresight methodologies and provides a public dataset and pipeline to inform AI governance.

Abstract

AI impact assessments often stress near-term risks because human judgment degrades over longer horizons, exemplifying the Collingridge dilemma: foresight is most needed when knowledge is scarcest. To address long-term systemic risks, we introduce a scalable approach that simulates in-silico agents using the strategic foresight method of the Futures Wheel. We applied it to four AI uses spanning Technology Readiness Levels (TRLs): Chatbot Companion (TRL 9, mature), AI Toy (TRL 7, medium), Griefbot (TRL 5, low), and Death App (TRL 2, conceptual). Across 30 agent runs per use, agents produced 86-110 consequences, condensed into 27-47 unique risks. To benchmark the agent outputs against human perspectives, we collected evaluations from 290 domain experts and 7 leaders, and conducted Futures Wheel sessions with 42 experts and 42 laypeople. Agents generated many systemic consequences across runs. Compared with these outputs, experts identified fewer risks, typically less systemic but judged more likely, whereas laypeople surfaced more emotionally salient concerns that were generally less systemic. We propose a hybrid foresight workflow, wherein agents broaden systemic coverage, and humans provide contextual grounding. Our dataset is available at: https://social-dynamics.net/ai-risks/foresight.

Agent-Supported Foresight for AI Systemic Risks: AI Agents for Breadth, Experts for Judgment

TL;DR

Abstract

Paper Structure (39 sections, 14 figures, 31 tables)

This paper contains 39 sections, 14 figures, 31 tables.

Introduction
Related Work
Limits of Human Foresight
Foresight Methods for AI
AI Risk Research
Author Positionality Statement
Methodology
Selecting AI Use Cases
Selecting the Futures Wheel as the Strategic Foresight Method to Identify Risks
Selecting Plurals as the LLM-Based Simulation Framework to Generate Risks
Developing a Pipeline to Combine Foresight and Generate Risks
Developing a Rubric to Evaluate the Generated Risks
Using the Rubric to Evaluate the Generated Risks
Results
RQ1: Can in-silico agents generate systemic risks of sufficient quality to support foresight?
...and 24 more sections

Figures (14)

Figure 1: Overview of our five-step approach for generating and evaluating systemic risks of novel AI uses. The process combines strategic foresight (via the Futures Wheel), LLM-based agentic simulation (via Plurals), and a structured evaluation rubric. Together, these steps allow us to generate systemic risks across AI use cases of varying technological maturity, and to compare agent-generated risks against human-identified ones.
Figure 2: Simulation of the Futures Wheel within our systemic risk pipeline. We adapt the Futures Wheel for use with six in-silico agents to move from initial AI use cases to consolidated lists of systemic risks. Agents first generate multi-order systemic consequences (Step 1), which are then classified into risks or benefits (Step 2), and finally deduplicated into non-redundant sets (Step 3). This process mirrors structured human foresight while ensuring diversity of outputs across agents and producing consistent, machine-generated inputs for later evaluation.
Figure 3: Futures Wheel interface used to collect human-ideated risks in both the human-only and human–AI collaboration conditions. The left panel (A) provides a structured description of the studied AI use case, following ISO 42005 guidelines by outlining its intended function and users, context of use, known limitations, and deployment environment. At the center, the focal use case is displayed together with round-specific prompts shown in pop-ups (B). Participants brainstorm first-order consequences, which then branch into second- and third-order consequences (C), allowing cascading impacts to be visualized across multiple Futures Wheel rounds. The right panel (D) provides optional support through a chat window for exploring potentially missing consequences, and a button to generate some of them automatically with AI. Interface elements A–C were used in both conditions, while D appeared only in the human–AI collaboration condition.
Figure 4: Example annotation card shown to domain experts for evaluating the risks generated by in-silico agents. The left panel illustrates a sample risk statement (i.e., "Overreliance on AI support reduces human-centered care quality", A), its potential impact (B), the AI use case it belongs to (i.e., "Chatbot companion", C), and a brief definition of systemic risk (D). The right panel (E) presents the evaluation metric from our rubric. It includes questions on likelihood, severity, and systemic classification, as well as ten risk characteristics organized across four dimensions: specificity, novelty, usability, and applicability.
Figure 5: Ratings of systemic risks generated by in-silico agents and humans across AI use cases and quality subdimensions. In-silico agents produced substantially more systemic risks than humans, who generated comparatively few. Boxplots with mean scores are shown for risks associated with four AI use cases: Chatbot Companion (C, TRL 9), AI Toy (T, TRL 7), Griefbot (G, TRL 5), and Death App (D, TRL 2). Each colored bar reflects the average domain expert rating (pink bar) or domain leader rating (blue bar) on a 5-point Likert scale across ten evaluation subdimensions from our rubric developed in Section \ref{['subsec:rubric']}. Domain experts generally rated the risks generated by in-silico agents as connected to the AI use, plausible, and moderately usable. Dimensions such as specificity, novelty, and originality received more varied ratings across use cases, especially for less mature concepts such as the Death App and Griefbot. Appendix \ref{['app:quantitative_risk_evaluation']} (Tables \ref{['tab:llm_human_comparison']}--\ref{['tab:deathapp_hybrid']}) reports the statistical tests and quantitative comparisons.
...and 9 more figures

Agent-Supported Foresight for AI Systemic Risks: AI Agents for Breadth, Experts for Judgment

TL;DR

Abstract

Agent-Supported Foresight for AI Systemic Risks: AI Agents for Breadth, Experts for Judgment

Authors

TL;DR

Abstract

Table of Contents

Figures (14)