Table of Contents
Fetching ...

Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison

Yoonho Lee, Joseph Boen, Chelsea Finn

TL;DR

Feedback Descent tackles open-ended text optimization by replacing scalar rewards with rich, structured textual feedback, enabling gradient-like updates in semantic space without weight changes. At each iteration a language model proposes a refined artifact, and a separate evaluator provides a binary preference plus explanatory rationale, which accumulates into a history of directional cues guiding edits. The approach is demonstrated across three domains—SVG design, prompt optimization, and DOCKSTRING molecule discovery—where it achieves competitive or superior results relative to domain-specific baselines such as GEPA and specialized molecular optimizers. It also formalizes the notion that textual rationales can enable dimension-free progress under favorable conditions, and shows that inference-time, text-based optimization can yield novel, high-quality solutions with practical impact in design, language tasks, and drug discovery, exemplified by a molecule ranking $s = -\text{Vina} - 10 \times (1 - \text{QED})$. Overall, Feedback Descent presents a versatile, domain-agnostic framework for continual improvement of text-representable artifacts through high-bandwidth, rationale-guided feedback.

Abstract

We introduce \textit{Feedback Descent}, a framework that optimizes text artifacts -- prompts, code, and molecules -- through structured textual feedback, rather than relying solely on scalar rewards. By preserving detailed critiques instead of compressing them to binary preferences, Feedback Descent widens the information bottleneck in preference learning, enabling directed optimization in text space rather than weight space. We show that in-context learning can transform structured feedback into gradient-like directional information, enabling targeted edits. Unlike prior approaches that collapse judgments into single bits, our evaluators pair each comparison with textual feedback, which functions as high-bandwidth supervision. The iteration loop is done purely at inference time, without modifying any model weights, and is task-agnostic. We evaluate Feedback Descent on three diverse domains and find that it outperforms state-of-the-art prompt optimization (GEPA), reinforcement learning methods (GRPO, REINVENT), and even specialized graph-based molecular optimizers. In the DOCKSTRING molecule discovery benchmark, Feedback Descent identifies novel drug-like molecules surpassing the $99.9$th percentile of a database with more than $260{,}000$ compounds across six protein targets.

Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison

TL;DR

Feedback Descent tackles open-ended text optimization by replacing scalar rewards with rich, structured textual feedback, enabling gradient-like updates in semantic space without weight changes. At each iteration a language model proposes a refined artifact, and a separate evaluator provides a binary preference plus explanatory rationale, which accumulates into a history of directional cues guiding edits. The approach is demonstrated across three domains—SVG design, prompt optimization, and DOCKSTRING molecule discovery—where it achieves competitive or superior results relative to domain-specific baselines such as GEPA and specialized molecular optimizers. It also formalizes the notion that textual rationales can enable dimension-free progress under favorable conditions, and shows that inference-time, text-based optimization can yield novel, high-quality solutions with practical impact in design, language tasks, and drug discovery, exemplified by a molecule ranking . Overall, Feedback Descent presents a versatile, domain-agnostic framework for continual improvement of text-representable artifacts through high-bandwidth, rationale-guided feedback.

Abstract

We introduce \textit{Feedback Descent}, a framework that optimizes text artifacts -- prompts, code, and molecules -- through structured textual feedback, rather than relying solely on scalar rewards. By preserving detailed critiques instead of compressing them to binary preferences, Feedback Descent widens the information bottleneck in preference learning, enabling directed optimization in text space rather than weight space. We show that in-context learning can transform structured feedback into gradient-like directional information, enabling targeted edits. Unlike prior approaches that collapse judgments into single bits, our evaluators pair each comparison with textual feedback, which functions as high-bandwidth supervision. The iteration loop is done purely at inference time, without modifying any model weights, and is task-agnostic. We evaluate Feedback Descent on three diverse domains and find that it outperforms state-of-the-art prompt optimization (GEPA), reinforcement learning methods (GRPO, REINVENT), and even specialized graph-based molecular optimizers. In the DOCKSTRING molecule discovery benchmark, Feedback Descent identifies novel drug-like molecules surpassing the th percentile of a database with more than compounds across six protein targets.

Paper Structure

This paper contains 22 sections, 3 theorems, 19 equations, 9 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

Let $r: Z \to \mathbb{R}$ be $L$-smooth and satisfy the $\mu$-PL condition (for maximization) At iteration $t$, suppose a direction $v_t$ satisfies with constants $\alpha>0$ and $\sigma \ge 0$, and define $\kappa_1 \triangleq \alpha^2+\sigma^2$. Consider the update $z_{t+1}=z_t+\eta v_t$. If a constraint set $Z$ is present, assume $z_t+\eta v_t\in Z$ (i.e., the projection is inactive). With step

Figures (9)

  • Figure 1: A conceptual illustration of feedback descent. At each iteration, we compare the previous best artifact with a new candidate. The evaluator provides both a pairwise preference and textual feedback. Preferences ensure the selection of better candidates, while feedback accumulates directional information that guides semantically meaningful edits.
  • Figure 2: Iterative progression of SVG unicorn optimization under the realism judge. Feedback Descent produces gradual, semantically meaningful improvements through accumulating directional cues.
  • Figure 3: Example images generated by Feedback Descent under six different judge criteria. Feedback Descent yields visually distinct objects aligned with the aesthetic criteria preferred by each judge.
  • Figure 4: DOCKSTRING scores for ADRB1, PGR, and PPARG with Feedback Descent at varying feedback noise levels. Performance degrades gracefully with increasing noise.
  • Figure 4: Optimization trajectories on PPARG showing docking scores over oracle calls for Feedback Descent and specialized baselines. Feedback Descent quickly improves molecular docking scores within the first few hundred oracle calls.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Proposition 1: Linear convergence under PL with rationale-guided directions
  • proof
  • Proposition 2: Grid-search lower bound
  • proof
  • Proposition 3: Best-of-$N$ random sampling lower bound
  • proof