Table of Contents
Fetching ...

Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search

Kyle McCleary, James Ghawaly

TL;DR

Accuracy improves with additional searches up to a small cap, hybrid lexical and dense retrieval with lightweight re-ranking produces the largest average gains in the authors' ablation grid, and larger completion budgets are most helpful on HotpotQA-style synthesis.

Abstract

Agentic Retrieval-Augmented Generation (RAG) systems combine iterative search, planning prompts, and retrieval backends, but deployed settings impose explicit budgets on tool calls and completion tokens. We present a controlled measurement study of how search depth, retrieval strategy, and completion budget affect accuracy and cost under fixed constraints. Using Budget-Constrained Agentic Search (BCAS), a model-agnostic evaluation harness that surfaces remaining budget and gates tool use, we run comparisons across six LLMs and three question-answering benchmarks. Across models and datasets, accuracy improves with additional searches up to a small cap, hybrid lexical and dense retrieval with lightweight re-ranking produces the largest average gains in our ablation grid, and larger completion budgets are most helpful on HotpotQA-style synthesis. These results provide practical guidance for configuring budgeted agentic retrieval pipelines and are accompanied by reproducible prompts and evaluation settings.

Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search

TL;DR

Accuracy improves with additional searches up to a small cap, hybrid lexical and dense retrieval with lightweight re-ranking produces the largest average gains in the authors' ablation grid, and larger completion budgets are most helpful on HotpotQA-style synthesis.

Abstract

Agentic Retrieval-Augmented Generation (RAG) systems combine iterative search, planning prompts, and retrieval backends, but deployed settings impose explicit budgets on tool calls and completion tokens. We present a controlled measurement study of how search depth, retrieval strategy, and completion budget affect accuracy and cost under fixed constraints. Using Budget-Constrained Agentic Search (BCAS), a model-agnostic evaluation harness that surfaces remaining budget and gates tool use, we run comparisons across six LLMs and three question-answering benchmarks. Across models and datasets, accuracy improves with additional searches up to a small cap, hybrid lexical and dense retrieval with lightweight re-ranking produces the largest average gains in our ablation grid, and larger completion budgets are most helpful on HotpotQA-style synthesis. These results provide practical guidance for configuring budgeted agentic retrieval pipelines and are accompanied by reproducible prompts and evaluation settings.
Paper Structure (35 sections, 3 figures, 2 tables)

This paper contains 35 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An example of DeepSeek V3 (0324) solving a HotpotQA 'hard' problem using BCAS with 4 searches and planning. This demonstrates the core principles of BCAS: the model operates within a strict 4-search budget (RQ1, RQ3) while leveraging system components like planning and iterative search to efficiently locate the answer (RQ2) and terminate early.
  • Figure 2: Search scaling and context scaling performance across TriviaQA, HotpotQA, and 2WikiMultihopQA. The "Planning" columns show the impact of adding pre-planning and reflection (RF) strategies. This figure provides the primary data for analyzing the accuracy-budget trade-off (RQ3). Confidence intervals computed using 95% Wilson score interval wilson1927probable.
  • Figure 3: Ablation study of BCAS features on HotpotQA: We measure the net effect on retrieval accuracy of different features compared to BCAS baseline with a BM25 retriever. 95% confidence intervals computed using Newcombe Method 9 newcombe1998paired.