Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search

Kyle McCleary; James Ghawaly

Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search

Kyle McCleary, James Ghawaly

TL;DR

Accuracy improves with additional searches up to a small cap, hybrid lexical and dense retrieval with lightweight re-ranking produces the largest average gains in the authors' ablation grid, and larger completion budgets are most helpful on HotpotQA-style synthesis.

Abstract

Agentic Retrieval-Augmented Generation (RAG) systems combine iterative search, planning prompts, and retrieval backends, but deployed settings impose explicit budgets on tool calls and completion tokens. We present a controlled measurement study of how search depth, retrieval strategy, and completion budget affect accuracy and cost under fixed constraints. Using Budget-Constrained Agentic Search (BCAS), a model-agnostic evaluation harness that surfaces remaining budget and gates tool use, we run comparisons across six LLMs and three question-answering benchmarks. Across models and datasets, accuracy improves with additional searches up to a small cap, hybrid lexical and dense retrieval with lightweight re-ranking produces the largest average gains in our ablation grid, and larger completion budgets are most helpful on HotpotQA-style synthesis. These results provide practical guidance for configuring budgeted agentic retrieval pipelines and are accompanied by reproducible prompts and evaluation settings.

Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search

TL;DR

Abstract

Paper Structure (35 sections, 3 figures, 2 tables)

This paper contains 35 sections, 3 figures, 2 tables.

Introduction
Research Questions
Related Work
Methodology
Architectural Principles
Prompt Policy
Budget Accounting
The BCAS Execution Loop
Optional Pre-planning
Optional Reflection
Information Retrieval Components
Retrieval and re-ranking pipeline.
Vector Embedding
Re-ranking
Evaluation
...and 20 more sections

Figures (3)

Figure 1: An example of DeepSeek V3 (0324) solving a HotpotQA 'hard' problem using BCAS with 4 searches and planning. This demonstrates the core principles of BCAS: the model operates within a strict 4-search budget (RQ1, RQ3) while leveraging system components like planning and iterative search to efficiently locate the answer (RQ2) and terminate early.
Figure 2: Search scaling and context scaling performance across TriviaQA, HotpotQA, and 2WikiMultihopQA. The "Planning" columns show the impact of adding pre-planning and reflection (RF) strategies. This figure provides the primary data for analyzing the accuracy-budget trade-off (RQ3). Confidence intervals computed using 95% Wilson score interval wilson1927probable.
Figure 3: Ablation study of BCAS features on HotpotQA: We measure the net effect on retrieval accuracy of different features compared to BCAS baseline with a BM25 retriever. 95% confidence intervals computed using Newcombe Method 9 newcombe1998paired.

Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search

TL;DR

Abstract

Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search

Authors

TL;DR

Abstract

Table of Contents

Figures (3)