Table of Contents
Fetching ...

Scaling Textual Gradients via Sampling-Based Momentum

Zixin Ding, Junyuan Hong, Zhan Shi, Jiachen T. Wang, Zinan Lin, Li Yin, Meng Liu, Zhangyang Wang, Yuxin Chen

TL;DR

The paper tackles the scalability challenges of prompt optimization via textual gradients by identifying explicit context-length limits and implicit long-context degradation. It introduces Text Stochastic Gradient Descent with Momentum (TSGD-M), which uses Gumbel-Top-$k$ sampling and momentum over past prompts to stabilize updates while controlling context growth. Efficient inference and validation are achieved through blockwise generation and minibatch-based running-mean estimators, enabling reliable scaling across multiple benchmarks. Empirical results demonstrate consistent gains over baseline methods (e.g., TextGrad, COPRO, AdalFlow) and thorough ablations illuminate the contribution of sampling, momentum, and validation strategies. The approach is framework-agnostic and offers a practical path toward robust, scalable automatic prompt engineering in real-world systems.

Abstract

LLM-based prompt optimization, that uses LLM-provided "textual gradients" (feedback) to refine prompts, has emerged an effective method for automatic prompt engineering. However, its scalability and stability are unclear when using more data in training. We systematically investigate the potential and challenges of scaling training data in textual gradient descent. We show that naively scaling training examples is infeasible due to both explicit context-length limits and an implicit context wall, where long-context degradation yields diminishing returns. Inspired by prior wisdom in stochastic gradient descent, we propose Textual Stochastic Gradient Descent with Momentum (TSGD-M), which reweights updates through momentum sampling, using bootstrapped minibatch validation accuracy as importance weights over historical prompts. We introduce Gumbel-Top-$k$ sampling for prompt generation, balancing exploration--exploitation and improving sampling efficiency while maintaining a low-variance running mean estimator. TSGD-M integrates seamlessly into existing prompt optimization frameworks, including TextGrad, DSPy-COPRO, and AdalFlow, and achieves consistent gains across 5 benchmarks.

Scaling Textual Gradients via Sampling-Based Momentum

TL;DR

The paper tackles the scalability challenges of prompt optimization via textual gradients by identifying explicit context-length limits and implicit long-context degradation. It introduces Text Stochastic Gradient Descent with Momentum (TSGD-M), which uses Gumbel-Top- sampling and momentum over past prompts to stabilize updates while controlling context growth. Efficient inference and validation are achieved through blockwise generation and minibatch-based running-mean estimators, enabling reliable scaling across multiple benchmarks. Empirical results demonstrate consistent gains over baseline methods (e.g., TextGrad, COPRO, AdalFlow) and thorough ablations illuminate the contribution of sampling, momentum, and validation strategies. The approach is framework-agnostic and offers a practical path toward robust, scalable automatic prompt engineering in real-world systems.

Abstract

LLM-based prompt optimization, that uses LLM-provided "textual gradients" (feedback) to refine prompts, has emerged an effective method for automatic prompt engineering. However, its scalability and stability are unclear when using more data in training. We systematically investigate the potential and challenges of scaling training data in textual gradient descent. We show that naively scaling training examples is infeasible due to both explicit context-length limits and an implicit context wall, where long-context degradation yields diminishing returns. Inspired by prior wisdom in stochastic gradient descent, we propose Textual Stochastic Gradient Descent with Momentum (TSGD-M), which reweights updates through momentum sampling, using bootstrapped minibatch validation accuracy as importance weights over historical prompts. We introduce Gumbel-Top- sampling for prompt generation, balancing exploration--exploitation and improving sampling efficiency while maintaining a low-variance running mean estimator. TSGD-M integrates seamlessly into existing prompt optimization frameworks, including TextGrad, DSPy-COPRO, and AdalFlow, and achieves consistent gains across 5 benchmarks.

Paper Structure

This paper contains 34 sections, 6 theorems, 28 equations, 11 figures, 6 tables, 4 algorithms.

Key Result

Proposition 1

Given arbitrary real-valued scores $s_i$ ($i\in\{1,\dots,n\}$), $k\le n$, and $\beta>0$, if $\epsilon_i \sim \mathrm{Gumbel}(0;\beta^{-1})$ i.i.d., then is an ordered sample without replacement from $\mathrm{Categorical}\!\left(\frac{e^{\beta s_i}}{\sum_j e^{\beta s_j}}\right)$.

Figures (11)

  • Figure 1: Comparison of different variants of TSGD. The standard TSGD update generates a new prompt from the last prompt and gradient. The momentum in TextGrad concatenates the past prompts in context to infer the next prompt. Our method upweights historic prompts in sampling that are of higher validation accuracy and only uses one pair of past prompt and a past gradient to infer the next block of tokens in the next prompt.
  • Figure 2: Scaling of TGD/TSGD on MATH. The gray line shows the average initial test accuracy across all dataset–batch-size combinations. Left: Comparing the test accuracy under different data and batch sizes, TGD (full-batch TSGD) cannot scale to larger data sizes, while minibatch TSGD enables the scaling. The dashed horizontal line represents the initial accuracy. Right: With a fixed data size of 200 and a seed of 1, we vary the batch size only. Small batch size has larger oscillations when larger batch sizes iterate more smoothly. An overly large batch size cannot improve the prompts.
  • Figure 3: Upon scaling training data, TextGrad-M outperforms TextGrad on the MATH task with a batch size of 5.
  • Figure 4: Average test accuracy over iterations on MATH. TextGrad presents a declining trend without validation revert and shows an unchanging pattern with validation revert due to a lack of exploration of distinct prompts.
  • Figure 5: Test Performance of vanilla TextGrad-Momentum and TextGrad-M on MATH with same window size. Error bars are the standard error.
  • ...and 6 more figures

Theorems & Definitions (10)

  • Proposition 1: Gumbel–Top-$k~$kirschstochastic
  • Theorem 3.1: Bayes optimality of posterior-mean argmax under equal precision
  • proof
  • Theorem 3.2: Information dominance of decisions using the fresh batch
  • proof
  • Corollary 3.1: Deterministic exploitation dominates given the fresh batch
  • proof
  • Corollary 3.2: Design implication for our loop
  • Theorem 3.3: TSGD-M (Blockwise) has no larger variance compared to TSGD
  • proof