Table of Contents
Fetching ...

BRAID: Bounded Reasoning for Autonomous Inference and Decisions

Armağan Amcalar, Eyup Cinar

TL;DR

BRAID introduces a bounded, diagrammatic reasoning framework that replaces unbounded natural-language traces with Mermaid diagrams to improve token efficiency and reliability in autonomous inference. Across GSM-Hard, SCALE MultiChallenge, and AdvancedIF benchmarks, BRAID enables smaller models to match or exceed larger-model performance while substantially lowering cost, as quantified by the Performance-per-Dollar (PPD) metric. A two-stage generation/solve pipeline and a caching strategy enable dramatic efficiency gains (up to tens of times the baseline) by decoupling reasoning topology from execution. The work demonstrates both accuracy improvements and economic advantages, highlighting BRAID as a scalable methodology for deploying cost-effective, reasoning-enabled autonomous agents. It also outlines concrete future directions for specialized graph generators, dynamic planning, and multimodal graph ingestion to extend BRAID’s applicability.

Abstract

Large Language Models (LLMs) exhibit nonlinear relationships between performance, cost, and token usage. This paper presents a quantitative study on structured prompting using BRAID (Bounded Reasoning for Au tonomous Inference and Decisions) across multiple GPT model tiers, eval uated on the AdvancedIF, GSM-Hard, and the SCALE MultiChallenge benchmark datasets. BRAID introduces a bounded reasoning framework using Mermaid-based instruction graphs that enable models to reason struc turally rather than through unbounded natural-language token expansion. We show that structured machine-readable prompts substantially increase reasoning accuracy and cost efficiency for agents in production systems. The findings establish BRAID as an effective and scalable technique for optimizing inference efficiency in autonomous agent systems. All datasets and detailed result logs are available at https://benchmark.openserv.ai.

BRAID: Bounded Reasoning for Autonomous Inference and Decisions

TL;DR

BRAID introduces a bounded, diagrammatic reasoning framework that replaces unbounded natural-language traces with Mermaid diagrams to improve token efficiency and reliability in autonomous inference. Across GSM-Hard, SCALE MultiChallenge, and AdvancedIF benchmarks, BRAID enables smaller models to match or exceed larger-model performance while substantially lowering cost, as quantified by the Performance-per-Dollar (PPD) metric. A two-stage generation/solve pipeline and a caching strategy enable dramatic efficiency gains (up to tens of times the baseline) by decoupling reasoning topology from execution. The work demonstrates both accuracy improvements and economic advantages, highlighting BRAID as a scalable methodology for deploying cost-effective, reasoning-enabled autonomous agents. It also outlines concrete future directions for specialized graph generators, dynamic planning, and multimodal graph ingestion to extend BRAID’s applicability.

Abstract

Large Language Models (LLMs) exhibit nonlinear relationships between performance, cost, and token usage. This paper presents a quantitative study on structured prompting using BRAID (Bounded Reasoning for Au tonomous Inference and Decisions) across multiple GPT model tiers, eval uated on the AdvancedIF, GSM-Hard, and the SCALE MultiChallenge benchmark datasets. BRAID introduces a bounded reasoning framework using Mermaid-based instruction graphs that enable models to reason struc turally rather than through unbounded natural-language token expansion. We show that structured machine-readable prompts substantially increase reasoning accuracy and cost efficiency for agents in production systems. The findings establish BRAID as an effective and scalable technique for optimizing inference efficiency in autonomous agent systems. All datasets and detailed result logs are available at https://benchmark.openserv.ai.

Paper Structure

This paper contains 20 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: a) Unstructured prompting encourages models to show intermediate steps in natural language before the answer b) Structured and Enhanced prompting techniques explicitly decomposes the problem into simpler sub-problems and solve sequentially c) BRAID replaces the natural-language prompt trace with structured, symbolic reasoning paths expressed in Mermaid diagrams
  • Figure 2: Comparative Reasoning Accuracy of BRAID vs Classic Prompting Best for each Solving Model: BRAID (blue) can enable smaller models to match or exceed the performance of larger models using Classic (hatched) across (a) GSM-Hard, (b) SCALE MultiChallenge, and (c) AdvancedIF instruction following benchmarks.
  • Figure 3: Average Cost Breakdown per Response (in US cents): Results contrast the inference costs of BRAID Generation and Solving phases against the Classic prompting baseline across various model architectures. Notably, the solving costs (light blue) for smaller models are significantly lower than the baseline, demonstrating a major economic advantage for agentic workflows that leverage cached Mermaid reasoning graphs.
  • Figure 4: Performance per Dollar (PPD) for BRAID Generation $\times$ Solving combinations on the SCALE MultiChallenge dataset for solving only model costs. Higher values indicate major cost efficiency relative to the gpt-5-medium classic baseline.
  • Figure 5: Performance per Dollar (PPD) for BRAID Generation $\times$ Solving combinations on the AdvancedIF dataset for solving only model costs. The metric highlights the cost-efficiency of nano-scale models for this task compared to the gpt-5-medium baseline.
  • ...and 1 more figures