Table of Contents
Fetching ...

Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

Paulo Akira F. Enabe

Abstract

Large language model agents that use external tools are often implemented through reactive execution, in which reasoning is repeatedly recomputed after each observation, increasing latency and sensitivity to error propagation. This work introduces Profile--Then--Reason (PTR), a bounded execution framework for structured tool-augmented reasoning, in which a language model first synthesizes an explicit workflow, deterministic or guarded operators execute that workflow, a verifier evaluates the resulting trace, and repair is invoked only when the original workflow is no longer reliable. A mathematical formulation is developed in which the full pipeline is expressed as a composition of profile, routing, execution, verification, repair, and reasoning operators; under bounded repair, the number of language-model calls is restricted to two in the nominal case and three in the worst case. Experiments against a ReAct baseline on six benchmarks and four language models show that PTR achieves the pairwise exact-match advantage in 16 of 24 configurations. The results indicate that PTR is particularly effective on retrieval-centered and decomposition-heavy tasks, whereas reactive execution remains preferable when success depends on substantial online adaptation.

Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

Abstract

Large language model agents that use external tools are often implemented through reactive execution, in which reasoning is repeatedly recomputed after each observation, increasing latency and sensitivity to error propagation. This work introduces Profile--Then--Reason (PTR), a bounded execution framework for structured tool-augmented reasoning, in which a language model first synthesizes an explicit workflow, deterministic or guarded operators execute that workflow, a verifier evaluates the resulting trace, and repair is invoked only when the original workflow is no longer reliable. A mathematical formulation is developed in which the full pipeline is expressed as a composition of profile, routing, execution, verification, repair, and reasoning operators; under bounded repair, the number of language-model calls is restricted to two in the nominal case and three in the worst case. Experiments against a ReAct baseline on six benchmarks and four language models show that PTR achieves the pairwise exact-match advantage in 16 of 24 configurations. The results indicate that PTR is particularly effective on retrieval-centered and decomposition-heavy tasks, whereas reactive execution remains preferable when success depends on substantial online adaptation.

Paper Structure

This paper contains 9 sections, 2 theorems, 41 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Proposition 2.1

Under bounded repair, if each workflow step admits at most $N_{\mathrm{rec}}$ deterministic retries, then for each task instance: (i) the total number of language-model invocations belongs to $\{2,3\}$, and (ii) the total number of tool invocations satisfies where $L$ is the length of the original workflow and $L^{\sharp}$ is the length of the repaired workflow, with $L^{\sharp} = 0$ in the absen

Figures (2)

  • Figure 1: Schematic representation of the PTR execution pipeline. Rectangular nodes denote state objects or operator applications, whereas the decision node denotes the deterministic repair trigger induced by the verification indicator $\xi$. The semantic stages PROFILE, REPAIR, and REASON are separated by deterministic routing, execution, and verification stages. The repair branch is activated at most once per task instance.
  • Figure 2: Dataset-level average exact-match difference $\Delta \mathrm{EM} = \mathrm{EM}_{\mathrm{PTR}} - \mathrm{EM}_{\mathrm{ReAct}}$, averaged over the four evaluated language models. Positive values indicate a PTR advantage, whereas negative values indicate a ReAct advantage. PTR is favored on retrieval-centered and decomposition-heavy tasks (NQ-Open, TriviaQA, GSM8K, StrategyQA), while ReAct retains an advantage on AQuA-RAT and HotPotQA, where success depends more strongly on symbolic flexibility or online search refinement.

Theorems & Definitions (5)

  • Definition 2.1: Bounded repair
  • Proposition 2.1: Bounded semantic complexity
  • proof
  • Proposition 2.2: Deterministic downstream execution
  • proof