Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

Shiyan Liu; Qifeng Xia; Qiyun Xia; Yisheng Liu; Xinyu Yu; Rui Qu

Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

Shiyan Liu, Qifeng Xia, Qiyun Xia, Yisheng Liu, Xinyu Yu, Rui Qu

Abstract

Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.

Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

Abstract

Paper Structure (57 sections, 8 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 57 sections, 8 equations, 15 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Automatic Prompt Optimization
LLM Self-Correction and Its Limits
Diagnosing the Black Box: Four Limitations
L1: Seed Trap
L2: Attribution Blindspot
L3: Trajectory Opacity
L4: Transfer Fragility
Proposed VISTA Framework
Overview
Hypothesis Generation
Semantic Trace
Two-Layer Explore-Exploit
Layer 1: Random Restart.
...and 42 more sections

Figures (15)

Figure 1: A representative failure of GEPA under the defective seed.
Figure 2: Conceptual illustration of optimization trajectories under the defective seed.
Figure 3: Four systematic limitations of reflective APO. L1--L3 form a causal chain, and L4 can apply even when optimization succeeds.
Figure 4: A case of attribution blindspot.
Figure 5: GEPA attribution distribution on GSM8K (Qwen3-4B base, Qwen3-8B/GPT-4o-mini reflectors).
...and 10 more figures

Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

Abstract

Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

Authors

Abstract

Table of Contents

Figures (15)