Table of Contents
Fetching ...

LLM-Generated Counterfactual Stress Scenarios for Portfolio Risk Simulation via Hybrid Prompt-RAG Pipeline

Masoud Soleimani

TL;DR

This work presents a transparent, auditable pipeline that combines retrieval-augmented LLMs with structured macro grounding to generate G7 macro scenarios and translate them into portfolio tail risk via a three-channel PCA-based framework. It demonstrates that prompt design and portfolio composition largely drive tail-risk variation, while retrieval and news provide only modest adjustments, yielding moderate but material tail amplification relative to historical baselines. The study introduces extensive plausibility checks, regime tagging, dispersion diagnostics, and snapshot-based reproducibility to support supervisory use. Overall, LLM-generated macro scenarios can scale and diversify stress narratives in a governance-friendly manner when paired with explicit structure, validation, and human oversight.

Abstract

We develop a transparent and fully auditable LLM-based pipeline for macro-financial stress testing, combining structured prompting with optional retrieval of country fundamentals and news. The system generates machine-readable macroeconomic scenarios for the G7, which cover GDP growth, inflation, and policy rates, and are translated into portfolio losses through a factor-based mapping that enables Value-at-Risk and Expected Shortfall assessment relative to classical econometric baselines. Across models, countries, and retrieval settings, the LLMs produce coherent and country-specific stress narratives, yielding stable tail-risk amplification with limited sensitivity to retrieval choices. Comprehensive plausibility checks, scenario diagnostics, and ANOVA-based variance decomposition show that risk variation is driven primarily by portfolio composition and prompt design rather than by the retrieval mechanism. The pipeline incorporates snapshotting, deterministic modes, and hash-verified artifacts to ensure reproducibility and auditability. Overall, the results demonstrate that LLM-generated macro scenarios, when paired with transparent structure and rigorous validation, can provide a scalable and interpretable complement to traditional stress-testing frameworks.

LLM-Generated Counterfactual Stress Scenarios for Portfolio Risk Simulation via Hybrid Prompt-RAG Pipeline

TL;DR

This work presents a transparent, auditable pipeline that combines retrieval-augmented LLMs with structured macro grounding to generate G7 macro scenarios and translate them into portfolio tail risk via a three-channel PCA-based framework. It demonstrates that prompt design and portfolio composition largely drive tail-risk variation, while retrieval and news provide only modest adjustments, yielding moderate but material tail amplification relative to historical baselines. The study introduces extensive plausibility checks, regime tagging, dispersion diagnostics, and snapshot-based reproducibility to support supervisory use. Overall, LLM-generated macro scenarios can scale and diversify stress narratives in a governance-friendly manner when paired with explicit structure, validation, and human oversight.

Abstract

We develop a transparent and fully auditable LLM-based pipeline for macro-financial stress testing, combining structured prompting with optional retrieval of country fundamentals and news. The system generates machine-readable macroeconomic scenarios for the G7, which cover GDP growth, inflation, and policy rates, and are translated into portfolio losses through a factor-based mapping that enables Value-at-Risk and Expected Shortfall assessment relative to classical econometric baselines. Across models, countries, and retrieval settings, the LLMs produce coherent and country-specific stress narratives, yielding stable tail-risk amplification with limited sensitivity to retrieval choices. Comprehensive plausibility checks, scenario diagnostics, and ANOVA-based variance decomposition show that risk variation is driven primarily by portfolio composition and prompt design rather than by the retrieval mechanism. The pipeline incorporates snapshotting, deterministic modes, and hash-verified artifacts to ensure reproducibility and auditability. Overall, the results demonstrate that LLM-generated macro scenarios, when paired with transparent structure and rigorous validation, can provide a scalable and interpretable complement to traditional stress-testing frameworks.

Paper Structure

This paper contains 76 sections, 10 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Pipeline for scenario generation and risk translation. IMF fundamentals and optional news are embedded in MiniLM wang2020minilm and retrieved via FAISS douze2025faiss to condition the prompt. LLMs output structured JSON scenarios (GDP, inflation, interest rates, rationale, and sector-level exposures), which are screened by hard and soft plausibility gates and tagged with a regime label. Scenario shocks are mapped to asset returns through three channels: (i) a pure volatility channel that scales the covariance matrix, (ii) a linear PCA factor channel, and (iii) a nonlinear polynomial factor channel that contains all text/RAG/news amplification. Regime severity $\lambda$ mixes calm and crisis covariance matrices, and all channels are benchmarked against deterministic, LLM-free baselines. Ablations toggle model type (GPT-5-mini vs. Llama-3.1-8B-Instruct), retrieval (RAG on/off), and news augmentation (on/off).
  • Figure 2: Violin plots of GDP, inflation, and interest rate shocks (percentage points) for all accepted G7 scenarios in the deterministic GPT-5-mini run ($N=627$). Each panel shows the full distribution by country; medians and interquartile ranges correspond to the summary statistics in Table \ref{['tab:macro-summary']}.
  • Figure 3: Comparison of average macro shock severity (left; mean absolute GDP, inflation, and interest rate shocks) and linear CVaR multiples (right) by model (GPT-5-mini vs. Llama-3.1-8B-Instruct), pooling over all countries and configurations (see Tables \ref{['tab:severity-model']} and \ref{['tab:risk-crossrun']}).
  • Figure 4: Boxplots of linear CVaR multiples for Portfolio A by country, pooling over all model/RAG/news configurations in the deterministic run. Values are expressed as multiples relative to the historical-bootstrap baseline; see Table \ref{['tab:risk-crossrun']}.
  • Figure 5: Scenario-induced CVaR multiples for Portfolio A plotted against inflation shocks (percentage points) across the three translation channels: volatility (left), linear (centre), and nonlinear (right). Each point is a scenario; colours (in the online version) indicate country. The weak relationship across all three panels highlights that tail risk emerges from the joint macro shock vector, regime mixing, and portfolio composition rather than inflation in isolation.
  • ...and 3 more figures