Table of Contents
Fetching ...

POLCA: Stochastic Generative Optimization with LLM

Xuanfei Ren, Allen Nie, Tengyang Xie, Ching-An Cheng

Abstract

Optimizing complex systems, ranging from LLM prompts to multi-turn agents, traditionally requires labor-intensive manual iteration. We formalize this challenge as a stochastic generative optimization problem where a generative language model acts as the optimizer, guided by numerical rewards and text feedback to discover the best system. We introduce Prioritized Optimization with Local Contextual Aggregation (POLCA), a scalable framework designed to handle stochasticity in optimization -- such as noisy feedback, sampling minibatches, and stochastic system behaviors -- while effectively managing the unconstrained expansion of solution space. POLCA maintains a priority queue to manage the exploration-exploitation tradeoff, systematically tracking candidate solutions and their evaluation histories. To enhance efficiency, we integrate an $\varepsilon$-Net mechanism to maintain parameter diversity and an LLM Summarizer to perform meta-learning across historical trials. We theoretically prove that POLCA converges to near-optimal candidate solutions under stochasticity. We evaluate our framework on diverse benchmarks, including $τ$-bench, HotpotQA (agent optimization), VeriBench (code translation) and KernelBench (CUDA kernel generation). Experimental results demonstrate that POLCA achieves robust, sample and time-efficient performance, consistently outperforming state-of-the-art algorithms in both deterministic and stochastic problems. The codebase for this work is publicly available at https://github.com/rlx-lab/POLCA.

POLCA: Stochastic Generative Optimization with LLM

Abstract

Optimizing complex systems, ranging from LLM prompts to multi-turn agents, traditionally requires labor-intensive manual iteration. We formalize this challenge as a stochastic generative optimization problem where a generative language model acts as the optimizer, guided by numerical rewards and text feedback to discover the best system. We introduce Prioritized Optimization with Local Contextual Aggregation (POLCA), a scalable framework designed to handle stochasticity in optimization -- such as noisy feedback, sampling minibatches, and stochastic system behaviors -- while effectively managing the unconstrained expansion of solution space. POLCA maintains a priority queue to manage the exploration-exploitation tradeoff, systematically tracking candidate solutions and their evaluation histories. To enhance efficiency, we integrate an -Net mechanism to maintain parameter diversity and an LLM Summarizer to perform meta-learning across historical trials. We theoretically prove that POLCA converges to near-optimal candidate solutions under stochasticity. We evaluate our framework on diverse benchmarks, including -bench, HotpotQA (agent optimization), VeriBench (code translation) and KernelBench (CUDA kernel generation). Experimental results demonstrate that POLCA achieves robust, sample and time-efficient performance, consistently outperforming state-of-the-art algorithms in both deterministic and stochastic problems. The codebase for this work is publicly available at https://github.com/rlx-lab/POLCA.
Paper Structure (76 sections, 3 theorems, 39 equations, 11 figures, 2 tables, 4 algorithms)

This paper contains 76 sections, 3 theorems, 39 equations, 11 figures, 2 tables, 4 algorithms.

Key Result

Theorem 1

Suppose $\mu:\Theta\rightarrow[0,B]$. If we run POLCA with the UCB priority defined in eq:ucb for $n$ iterations, then the expected total number of selections for programs with rewards in $[0,B-\gamma]$ is bounded by When the reward observations are deterministic ($\sigma=0$), the bound becomes independent of the program space and depends only on the reward space and the optimization oracle:

Figures (11)

  • Figure 1: Left: The POLCA framework for generative optimization. POLCA maintains a memory buffer as an $\varepsilon$-Net to ensure diverse program storage. In each iteration, it selects promising parameter candidates from the $\varepsilon$-Net, evaluates them against a sampled minibatch, and generates new candidate parameters based on the feedback. These candidates undergo a semantic Filtering stage; accepted parameters are evaluated on the minibatch and integrated into the $\varepsilon$-Net. Finally, a Summarize step compresses the memory to provide concise global context $C$ for the next optimization cycle. Right: Normalized performance averaged across benchmarks ($\tau$-Bench, HotpotQA, VeriBench, and KernelBench). The solid curve represents the mean, while the shaded region indicates the standard error across all benchmarks. Results are aggregated by standardizing scores and computational budgets to a scale of $[0, 1]$.
  • Figure 2: Search efficiency across four benchmarks. Solid curves represent the average highest score attained at each step, while the shaded regions denote the standard error across multiple independent runs (6 seeds for $\tau$-bench, 3 for HotpotQA and VeriBench (3-step evaluation), and 1 for KernelBench). Higher curves indicate superior efficiency.
  • Figure 3: In (a--c), solid curves show the mean highest score achieved at each step, with shaded areas representing the standard error over independent runs (6 seeds for (a); 3 seeds for (b, c)). In (d), bar heights denote test scores for programs selected via different criteria across varying training data percentages. Results are averaged over 3 runs, with error bars indicating the standard error. See \ref{['subsection:no-regressor']} for details.
  • Figure 4: $\tau$-bench: performance vs. number of samples (left), proposal steps (middle), and number of proposals (right). Solid curves represent the average highest score attained at each step, while the shaded regions denote the standard error across multiple independent runs (6 seeds).
  • Figure 5: HotpotQA: performance vs. number of samples (left), proposal steps (middle), and number of proposals (right). Solid curves represent the average highest score attained at each step, while the shaded regions denote the standard error across multiple independent runs (3 seeds).
  • ...and 6 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • Lemma 3: Bounded selection number for each interval