GRAD-SUM: Leveraging Gradient Summarization for Optimal Prompt Engineering
Derek Austin, Elliott Chartock
TL;DR
GRAD-SUM addresses the problem of manually crafting prompts for LLMs by introducing a gradient-based automatic prompt optimization loop that incorporates task descriptions and user-defined evaluation criteria. The framework comprises generation, evaluation, gradients, gradient summarization, and prompt editing, all guided by a beam search and LLM-as-a-judge feedback. Empirical results across multiple benchmarks including GSM8K, Orca Math, Neural Bridge RAG, HellaSwag, HotPot QA, MMLU, and MT/Vicuna Bench show consistent improvements over initial prompts and a strong advantage over DSPY, with average gains around 14%. The work demonstrates a scalable, flexible approach to auto-prompt tuning that can operate without an explicit ground-truth answer and highlights gradient summarization as key to generalization and cost control.
Abstract
Prompt engineering for large language models (LLMs) is often a manual time-intensive process that involves generating, evaluating, and refining prompts iteratively to ensure high-quality outputs. While there has been work on automating prompt engineering, the solutions generally are either tuned to specific tasks with given answers or are quite costly. We introduce GRAD-SUM, a scalable and flexible method for automatic prompt engineering that builds on gradient-based optimization techniques. Our approach incorporates user-defined task descriptions and evaluation criteria, and features a novel gradient summarization module to generalize feedback effectively. Our results demonstrate that GRAD-SUM consistently outperforms existing methods across various benchmarks, highlighting its versatility and effectiveness in automatic prompt optimization.
