Learning to Configure Agentic AI Systems

Aditya Taparia; Som Sagar; Ransalu Senanayake

Learning to Configure Agentic AI Systems

Aditya Taparia, Som Sagar, Ransalu Senanayake

TL;DR

This work introduces ARC, a hierarchical reinforcement learning framework that dynamically configures LLM-based agent systems on a per-query basis by jointly selecting workflows, tools, budgets, and prompts. The architecture splits decision-making into a structure policy and a prompt policy, trained with PPO on shaped rewards and augmented by an SFT post-training refinement that guarantees performance concentration on elite configurations. Across multiple reasoning and tool-use benchmarks, ARC outperforms static templates, grid/greedy search, and flat RL baselines while reducing token usage and runtime, demonstrating significant gains in accuracy and efficiency. The approach offers a scalable, adaptable alternative to one-size-fits-all designs, with transfer behavior that favors cross-task structural generalization and positive scaling with model capacity.

Abstract

Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed large templates or hand-tuned heuristics. This leads to brittle behavior and unnecessary compute, since the same cumbersome configuration is often applied to both easy and hard input queries. We formulate agent configuration as a query-wise decision problem and introduce ARC (Agentic Resource & Configuration learner), which learns a light-weight hierarchical policy using reinforcement learning to dynamically tailor these configurations. Across multiple benchmarks spanning reasoning and tool-augmented question answering, the learned policy consistently outperforms strong hand-designed and other baselines, achieving up to 25% higher task accuracy while also reducing token and runtime costs. These results demonstrate that learning per-query agent configurations is a powerful alternative to "one size fits all" designs.

Learning to Configure Agentic AI Systems

TL;DR

Abstract

Paper Structure (36 sections, 4 theorems, 16 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 36 sections, 4 theorems, 16 equations, 13 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Methodology
Hierarchical Policy Architecture
Structure Policy
Prompt Policy
Training Procedure
Post-Training Refinement
Theoretical Guarantees
Experiments
Experimental Setup
RQ1: Does Learning Configuration Improve Performance?
RQ2: Does Adaptive Allocation Improve Efficiency?
RQ3: Can Policies Transfer Across Tasks and Model Capacity?
Ablation Studies
...and 21 more sections

Key Result

Lemma 3.1

Let $\hat{p}_{\text{elite}}(a|s) = n(s,a)/n(s)$ denote the empirical distribution over $\mathcal{D}_{\text{elite}}$. Under sufficient model capacity, the objective in Eq. eq:sft_objective is minimized when $\pi_\theta = \hat{p}_{\text{elite}}$.

Figures (13)

Figure 1: (a) Shows how our method learns to configure optimal configuration across thousands of possibilities for the given input. (b) Shows improvement by our method over multiple datasets. (These results are for Qwen 2.5 7B Instruct model.)
Figure 2: Training pipeline. The structure policy selects workflows, tools, and budgets while the prompt policy composes instructions. During RL training, episodes are stored in a memory buffer. After RL converges, high-reward episodes are filtered and used for supervised fine-tuning (SFT), which consolidates successful strategies and improves consistency.
Figure 3: Action masking reduces the effective action-sequence within the RL policy.
Figure 4: Accuracy Vs. Cost trade-off on GSM8K. Each point shows average test accuracy versus inference cost for a method. The dashed curve denotes the Pareto frontier, representing non-dominated methods that achieve the best possible accuracy for a given cost. Red points correspond to our ARC variants, which lie on or define the Pareto frontier, indicating superior accuracy–cost efficiency compared to existing baselines.
Figure 5: Scaling trends of model accuracy with capacity. Accuracy as a function of model size for the Qwen 2.5 family (7B, 32B, 72B) across four benchmarks. Performance improves consistently with scale, with gains varying by task complexity.
...and 8 more figures

Theorems & Definitions (8)

Lemma 3.1: MLE Convergence
proof : Proof sketch
Theorem 3.2: Policy Concentration
proof : Proof sketch
Lemma 3.1: MLE Convergence, Restated
proof
Theorem 3.2: Policy Concentration, Restated
proof

Learning to Configure Agentic AI Systems

TL;DR

Abstract

Learning to Configure Agentic AI Systems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (8)