Scaling Textual Gradients via Sampling-Based Momentum
Zixin Ding, Junyuan Hong, Zhan Shi, Jiachen T. Wang, Zinan Lin, Li Yin, Meng Liu, Zhangyang Wang, Yuxin Chen
TL;DR
The paper tackles the scalability challenges of prompt optimization via textual gradients by identifying explicit context-length limits and implicit long-context degradation. It introduces Text Stochastic Gradient Descent with Momentum (TSGD-M), which uses Gumbel-Top-$k$ sampling and momentum over past prompts to stabilize updates while controlling context growth. Efficient inference and validation are achieved through blockwise generation and minibatch-based running-mean estimators, enabling reliable scaling across multiple benchmarks. Empirical results demonstrate consistent gains over baseline methods (e.g., TextGrad, COPRO, AdalFlow) and thorough ablations illuminate the contribution of sampling, momentum, and validation strategies. The approach is framework-agnostic and offers a practical path toward robust, scalable automatic prompt engineering in real-world systems.
Abstract
LLM-based prompt optimization, that uses LLM-provided "textual gradients" (feedback) to refine prompts, has emerged an effective method for automatic prompt engineering. However, its scalability and stability are unclear when using more data in training. We systematically investigate the potential and challenges of scaling training data in textual gradient descent. We show that naively scaling training examples is infeasible due to both explicit context-length limits and an implicit context wall, where long-context degradation yields diminishing returns. Inspired by prior wisdom in stochastic gradient descent, we propose Textual Stochastic Gradient Descent with Momentum (TSGD-M), which reweights updates through momentum sampling, using bootstrapped minibatch validation accuracy as importance weights over historical prompts. We introduce Gumbel-Top-$k$ sampling for prompt generation, balancing exploration--exploitation and improving sampling efficiency while maintaining a low-variance running mean estimator. TSGD-M integrates seamlessly into existing prompt optimization frameworks, including TextGrad, DSPy-COPRO, and AdalFlow, and achieves consistent gains across 5 benchmarks.
