Table of Contents
Fetching ...

Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models

Saaduddin Mahmud, Mason Nakamura, Kyle Hollins Wray, Shlomo Zilberstein

TL;DR

This work addresses the misalignment that arises when prompt optimization for black-box LLMs ignores inference-time strategies. It introduces Iapo, a framework that jointly optimizes prompts and inference scaling under user budgets, cast as a contextual best-arm identification problem; a fixed-budget training algorithm PSST (and a warm-up heuristic) provides finite-budget guarantees. The approach is extended with Top-$K$ screening to boost efficiency in low-budget regimes. Across six diverse tasks, including multi-objective reasoning and summarization, inference-aware optimization consistently improves cost-adjusted performance over inference-agnostic baselines, demonstrating that prompt quality and inference strategy are intrinsically linked. The results highlight practical implications for reliable, budget-conscious alignment of black-box LLMs and point to future directions in richer inference policies and latency-constrained multi-objective deployment.

Abstract

Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as Best-of-N Sampling and Majority Voting have likewise been shown to improve alignment and performance by trading additional computation for better output. However, existing prompt optimization approaches are inference strategy agnostic; that is, they optimize prompts without accounting for the inference strategy. This constitutes a significant methodological gap, as our empirical and theoretical analysis reveals a strong interdependence between these two paradigms. Moreover, we find that user preferences regarding trade-offs among multiple objectives and inference budgets substantially influence the choice of prompt and inference configuration. To address this gap, we introduce a novel unified framework named IAPO (Inference-Aware Prompt Optimization) that jointly optimizes the prompt and inference scale, while being aware of the inference budget and different task objectives. We then develop a fixed-budget training algorithm for IAPO, called PSST (Prompt Scaling via Sequential Trimming), and establish finite-budget guarantees on the error probability. Finally, we evaluate the effectiveness of PSST on six tasks, including multi-objective text generation and reasoning, and demonstrate the critical role of incorporating inference-awareness in aligning black-box LLMs using prompt optimization.

Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models

TL;DR

This work addresses the misalignment that arises when prompt optimization for black-box LLMs ignores inference-time strategies. It introduces Iapo, a framework that jointly optimizes prompts and inference scaling under user budgets, cast as a contextual best-arm identification problem; a fixed-budget training algorithm PSST (and a warm-up heuristic) provides finite-budget guarantees. The approach is extended with Top- screening to boost efficiency in low-budget regimes. Across six diverse tasks, including multi-objective reasoning and summarization, inference-aware optimization consistently improves cost-adjusted performance over inference-agnostic baselines, demonstrating that prompt quality and inference strategy are intrinsically linked. The results highlight practical implications for reliable, budget-conscious alignment of black-box LLMs and point to future directions in richer inference policies and latency-constrained multi-objective deployment.

Abstract

Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as Best-of-N Sampling and Majority Voting have likewise been shown to improve alignment and performance by trading additional computation for better output. However, existing prompt optimization approaches are inference strategy agnostic; that is, they optimize prompts without accounting for the inference strategy. This constitutes a significant methodological gap, as our empirical and theoretical analysis reveals a strong interdependence between these two paradigms. Moreover, we find that user preferences regarding trade-offs among multiple objectives and inference budgets substantially influence the choice of prompt and inference configuration. To address this gap, we introduce a novel unified framework named IAPO (Inference-Aware Prompt Optimization) that jointly optimizes the prompt and inference scale, while being aware of the inference budget and different task objectives. We then develop a fixed-budget training algorithm for IAPO, called PSST (Prompt Scaling via Sequential Trimming), and establish finite-budget guarantees on the error probability. Finally, we evaluate the effectiveness of PSST on six tasks, including multi-objective text generation and reasoning, and demonstrate the critical role of incorporating inference-awareness in aligning black-box LLMs using prompt optimization.

Paper Structure

This paper contains 33 sections, 7 theorems, 32 equations, 11 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Inference-agnostic prompt-optimization methods optimize cost-unaware arithmetic mean utility.

Figures (11)

  • Figure 1: Inference-agnostic vs. inference-aware prompt optimization. The left side illustrates standard prompt optimization, which treats the inference strategy as fixed: a best prompt is selected during training and then used at inference with a predetermined number of samples, which can lead to misaligned outputs and high inference cost for some queries. The right side shows our inference-aware framework Iapo with the Psst algorithm, which conditions on user context such as budget and preferences, jointly selects the prompt and inference scale, and produces responses that better satisfy objectives and budget. Project page, code, and appendix are available online (https://iapo-aaai25.github.io/).
  • Figure 2: Prompt–Inference Interdependence. (a) Accuracy under MV with Llama-3.3-70B-Instruct, showing prompt dominance shifts with budget (shaded). (b, c) Cost-adjusted reward under BoN decoding. Prompt and inference scales vary with user-specified trade-offs.
  • Figure 3: Expected utility ($w_{k+1} = 0$) for MV (left) and BoN (right). MV shows a sharp performance drop when the correctness probability $\theta$ drops below 0.5, whereas BoN is strictly concave.
  • Figure 4: Comparison between exploration strategies across six datasets.
  • Figure 5: Effectiveness of inference-aware optimization across six datasets.
  • ...and 6 more figures

Theorems & Definitions (11)

  • Proposition 1: Inference-Agnostic Utility
  • Proposition 2: Inference-Agnostic Optimality
  • Theorem 1: Error of Psst
  • Theorem 2: Error of Psst
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • proof : Proof of Theorem 1
  • Proposition 2: Inference-Agnostic Optimality
  • ...and 1 more