Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together
Dilara Soylu, Christopher Potts, Omar Khattab
TL;DR
This work tackles the challenge of end-to-end optimization for modular LM pipelines by proposing BetterTogether, a framework that alternates prompt-template optimization and LM weight fine-tuning within the DSPy LM Program. Using bootstrapped traces to guide both steps, the approach is evaluated on HotPotQA, GSM8K, and Iris across three LMs, showing that joint optimization typically outperforms prompts-only or weights-only baselines, with substantial gains (up to 78% on HotPotQA and 88% on Iris). The key contribution is the demonstration that the same LM can be taught to improve both its prompts and its weights within a pipeline, supported by BootstrapFewshotRS and BootstrapFinetune (LoRA) mechanisms. This has practical implications for building more reliable, multi-stage NLP systems and provides a concrete, publicly available framework (DSPy) for researchers and developers to adopt and extend.
Abstract
Natural Language Processing (NLP) systems are increasingly taking the form of sophisticated modular pipelines, e.g., Retrieval Augmented Generation (RAG), where each module may involve a distinct Language Model (LM) and an associated prompt template. These compound systems often lack intermediate labels or gradient flow to optimize each module, making their end-to-end optimization challenging. Here we seek strategies to optimize both the module-level LM weights and the associated prompt templates of such systems to maximize a downstream task metric. We propose for the first time combining the weight and prompt optimization strategies to optimize a modular LM pipeline by alternating between the two to get the same LM to teach itself. In experiments with multi-hop QA, mathematical reasoning, and feature-based classification using mistral-7b, llama-2-7b, and llama-3-8b, these BetterTogether strategies optimizing the weights and prompts of a pipeline together outperform directly optimizing weights alone and prompts alone by up to 60% and 6%, respectively, on average across LMs and tasks. BetterTogether optimizer is released in DSPy at http://dspy.ai
