Table of Contents
Fetching ...

Dynamic Policy Induction for Adaptive Prompt Optimization: Bridging the Efficiency-Accuracy Gap via Lightweight Reinforcement Learning

Jiexi Xu

TL;DR

This work tackles the rigid efficiency-accuracy trade-off in prompting large language models by introducing the Prompt Policy Network (PPN), a lightweight external RL agent that selects a prompting strategy as a single-step MDP. Trained via Proximal Policy Optimization with a resource-aware reward $R = \alpha \cdot \text{Accuracy} - \beta \cdot \text{Computational Cost}$, the PPN dynamically reserves costly reasoning for inputs that justify it, reducing token costs while maintaining performance. The approach achieves Pareto-optimal improvements on reasoning benchmarks, notably up to 61.5% token-cost reductions versus Self-Consistency with competitive accuracy, and demonstrates adaptive strategy distributions that favor efficiency on easy inputs. The work provides a practical framework for cost-efficient, scalable LLM deployment, enabling finer-grained control over resource usage without sacrificing essential reasoning capabilities.

Abstract

The performance of Large Language Models (LLMs) depends heavily on the chosen prompting strategy, yet static approaches such as Zero-Shot, Few-Shot, or Chain-of-Thought (CoT) impose a rigid efficiency-accuracy trade-off. Highly accurate strategies like Self-Consistency (SC) incur substantial computational waste on simple tasks, while lightweight methods often fail on complex inputs. This paper introduces the Prompt Policy Network (PPN), a lightweight reinforcement learning framework that formalizes adaptive strategy selection as a single-step Markov Decision Process (MDP). The PPN, trained with Proximal Policy Optimization (PPO) and guided by a resource-explicit reward function, learns to allocate costly reasoning strategies only when necessary. Experiments on arithmetic reasoning benchmarks demonstrate that PPN achieves superior performance on the efficiency-accuracy Pareto front, delivering up to 61.5% token cost reduction compared to Self-Consistency while maintaining competitive accuracy. This work contributes a systematic, adaptive framework for cost-efficient LLM deployment, advancing the design of lightweight optimization techniques for scalable and sustainable language model applications.

Dynamic Policy Induction for Adaptive Prompt Optimization: Bridging the Efficiency-Accuracy Gap via Lightweight Reinforcement Learning

TL;DR

This work tackles the rigid efficiency-accuracy trade-off in prompting large language models by introducing the Prompt Policy Network (PPN), a lightweight external RL agent that selects a prompting strategy as a single-step MDP. Trained via Proximal Policy Optimization with a resource-aware reward , the PPN dynamically reserves costly reasoning for inputs that justify it, reducing token costs while maintaining performance. The approach achieves Pareto-optimal improvements on reasoning benchmarks, notably up to 61.5% token-cost reductions versus Self-Consistency with competitive accuracy, and demonstrates adaptive strategy distributions that favor efficiency on easy inputs. The work provides a practical framework for cost-efficient, scalable LLM deployment, enabling finer-grained control over resource usage without sacrificing essential reasoning capabilities.

Abstract

The performance of Large Language Models (LLMs) depends heavily on the chosen prompting strategy, yet static approaches such as Zero-Shot, Few-Shot, or Chain-of-Thought (CoT) impose a rigid efficiency-accuracy trade-off. Highly accurate strategies like Self-Consistency (SC) incur substantial computational waste on simple tasks, while lightweight methods often fail on complex inputs. This paper introduces the Prompt Policy Network (PPN), a lightweight reinforcement learning framework that formalizes adaptive strategy selection as a single-step Markov Decision Process (MDP). The PPN, trained with Proximal Policy Optimization (PPO) and guided by a resource-explicit reward function, learns to allocate costly reasoning strategies only when necessary. Experiments on arithmetic reasoning benchmarks demonstrate that PPN achieves superior performance on the efficiency-accuracy Pareto front, delivering up to 61.5% token cost reduction compared to Self-Consistency while maintaining competitive accuracy. This work contributes a systematic, adaptive framework for cost-efficient LLM deployment, advancing the design of lightweight optimization techniques for scalable and sustainable language model applications.

Paper Structure

This paper contains 24 sections, 2 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: PPN Architecture and Lightweight RL Optimization Loop. The PPN takes the encoded query state $F_Q$, selects a strategy $P_i$, and receives feedback ($R$) from the LLM execution environment.
  • Figure 2: Efficiency-Accuracy Pareto Front (Conceptual Visualization). Fixed strategies (ZS, CoT, SC) occupy discrete, sub-optimal points. PPN policies (varying $\alpha/\beta$) lie on a superior Pareto front, confirming maximum utility maximization.