Table of Contents
Fetching ...

Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

Shiyan Liu, Qifeng Xia, Qiyun Xia, Yisheng Liu, Xinyu Yu, Rui Qu

Abstract

Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.

Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

Abstract

Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.
Paper Structure (57 sections, 8 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 57 sections, 8 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: A representative failure of GEPA under the defective seed.
  • Figure 2: Conceptual illustration of optimization trajectories under the defective seed.
  • Figure 3: Four systematic limitations of reflective APO. L1--L3 form a causal chain, and L4 can apply even when optimization succeeds.
  • Figure 4: A case of attribution blindspot.
  • Figure 5: GEPA attribution distribution on GSM8K (Qwen3-4B base, Qwen3-8B/GPT-4o-mini reflectors).
  • ...and 10 more figures