Table of Contents
Fetching ...

LLM-AutoDiff: Auto-Differentiate Any LLM Workflow

Li Yin, Zhangyang Wang

TL;DR

This paper addresses the bottleneck of manual prompt engineering in complex LLM pipelines by introducing LLM-AutoDiff, a graph-based framework that generalizes textual gradient methods to multi-node, potentially cyclic workflows. It formalizes Automatic LLM Application Optimization (ALAO) and Automatic Prompt Engineering (APE), and implements them in AdalFlow, enabling a frozen backward LLM to produce textual gradients that propagate through both LLM and functional nodes via a dynamic parameter graph $\mathcal{G}_p$. Key innovations include pass-through and time-sequential gradients, peer-aware sub-prompts, selective gradient computation, and a Gradient-Driven Prompt Optimizer (GDPO) that leverages historical feedback for coherent updates. Extensive experiments on single-node and multi-node pipelines (including multi-hop RAG and ReAct-style agents) show improved final accuracy and reduced training cost relative to Text-Grad and DsPy baselines, demonstrating the practical potential of end-to-end automatic optimization of LLM applications. The framework paves the way for scalable, automated refinement of large LLM ecosystems, with future directions toward hyperparameter co-optimization, dynamic graph reconfiguration, and multimodal integration.

Abstract

Large Language Models (LLMs) have reshaped natural language processing, powering applications from multi-hop retrieval and question answering to autonomous agent workflows. Yet, prompt engineering -- the task of crafting textual inputs to effectively direct LLMs -- remains difficult and labor-intensive, particularly for complex pipelines that combine multiple LLM calls with functional operations like retrieval and data formatting. We introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering (APE) that extends textual gradient-based methods (such as Text-Grad) to multi-component, potentially cyclic LLM architectures. Implemented within the AdalFlow library, LLM-AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine LLM to generate feedback-akin to textual gradients -- that guide iterative prompt updates. Unlike prior single-node approaches, LLM-AutoDiff inherently accommodates functional nodes, preserves time-sequential behavior in repeated calls (e.g., multi-hop loops), and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts (instructions, formats, or few-shot examples). It further boosts training efficiency by focusing on error-prone samples through selective gradient computation. Across diverse tasks, including single-step classification, multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff consistently outperforms existing textual gradient baselines in both accuracy and training cost. By unifying prompt optimization through a graph-centric lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating LLM workflows - mirroring the transformative role that automatic differentiation libraries have long played in neural network research.

LLM-AutoDiff: Auto-Differentiate Any LLM Workflow

TL;DR

This paper addresses the bottleneck of manual prompt engineering in complex LLM pipelines by introducing LLM-AutoDiff, a graph-based framework that generalizes textual gradient methods to multi-node, potentially cyclic workflows. It formalizes Automatic LLM Application Optimization (ALAO) and Automatic Prompt Engineering (APE), and implements them in AdalFlow, enabling a frozen backward LLM to produce textual gradients that propagate through both LLM and functional nodes via a dynamic parameter graph . Key innovations include pass-through and time-sequential gradients, peer-aware sub-prompts, selective gradient computation, and a Gradient-Driven Prompt Optimizer (GDPO) that leverages historical feedback for coherent updates. Extensive experiments on single-node and multi-node pipelines (including multi-hop RAG and ReAct-style agents) show improved final accuracy and reduced training cost relative to Text-Grad and DsPy baselines, demonstrating the practical potential of end-to-end automatic optimization of LLM applications. The framework paves the way for scalable, automated refinement of large LLM ecosystems, with future directions toward hyperparameter co-optimization, dynamic graph reconfiguration, and multimodal integration.

Abstract

Large Language Models (LLMs) have reshaped natural language processing, powering applications from multi-hop retrieval and question answering to autonomous agent workflows. Yet, prompt engineering -- the task of crafting textual inputs to effectively direct LLMs -- remains difficult and labor-intensive, particularly for complex pipelines that combine multiple LLM calls with functional operations like retrieval and data formatting. We introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering (APE) that extends textual gradient-based methods (such as Text-Grad) to multi-component, potentially cyclic LLM architectures. Implemented within the AdalFlow library, LLM-AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine LLM to generate feedback-akin to textual gradients -- that guide iterative prompt updates. Unlike prior single-node approaches, LLM-AutoDiff inherently accommodates functional nodes, preserves time-sequential behavior in repeated calls (e.g., multi-hop loops), and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts (instructions, formats, or few-shot examples). It further boosts training efficiency by focusing on error-prone samples through selective gradient computation. Across diverse tasks, including single-step classification, multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff consistently outperforms existing textual gradient baselines in both accuracy and training cost. By unifying prompt optimization through a graph-centric lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating LLM workflows - mirroring the transformative role that automatic differentiation libraries have long played in neural network research.

Paper Structure

This paper contains 55 sections, 14 equations, 8 figures, 4 tables, 2 algorithms.

Figures (8)

  • Figure 2: Representing an LLM application as an auto-differentiable computation graph. We illustrate how each node in the graph can be one of three types: an LLM component, a functional component, or a loss component. By coupling this graph with a gradient-driven LLM prompt optimizer, the traditionally labor-intensive task of manually crafting prompts is automated. During training, a forward pass traces every intermediate input and output, while the backward pass backpropagates "evaluation" signals (e.g., losses or textual gradients) through the network to locate and correct errors. Crucially, multiple invocations of the same node (such as when LLM_query_0 and LLM_query_1 are the same component called twice, along with the retriever being called twice) can accumulate multiple gradients, making it essential for the auto-differentiation engine to preserve the correct sequence of updates. This unified approach not only highlights where mistakes arise in multi-component pipelines (including loops), but also provides a systematic means to refine all relevant prompt parameters in tandem.
  • Figure 3: Example of a ReACT Auto-Differentiable Graph. This configuration extends our multi-hop RAG paradigm by incorporating a ReACT planner and functional nodes for tool usage and final output assembly. Here, ExecuteAction calls the retriever module when needed, while Finish is a simple function—governed by its own function docstring—that merges the accumulated step history into a coherent answer. The Combine node is a functional component that aggregates outputs from both ExecuteAction and Finish before producing the final output. Notably, a skip connection from the Task Description provides direct feedback to the ReACT planner, enabling more targeted textual gradients across the entire system. This design highlights how agent-style loops, functional transformations, and skip connections can all be captured in a unified auto-differentiable framework for LLM-driven workflows.
  • Figure 4: Gradient-Driven Prompt Optimizer. The optimizer LLM is prompted with the textual gradients and the previous prompt, along with a multi-node “system view.” It proposes new subprompt texts aligned with the identified errors.
  • Figure 5: Object count computation graph
  • Figure 6: Trec-6 computation graph
  • ...and 3 more figures