Table of Contents
Fetching ...

Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

Luís Silva, Diogo Gonçalves, Catarina Farinha, Clara Matos, Luís Ungaro

TL;DR

The results indicate that architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines operating under single-prompt baselines.

Abstract

Large language models struggle to maintain strict adherence to structured workflows in high-stakes domains such as healthcare triage. Monolithic approaches that encode entire decision structures within a single prompt are prone to instruction-following degradation as prompt length increases, including lost-in-the-middle effects and context window overflow. To address this gap, we present Arbor, a framework that decomposes decision tree navigation into specialized, node-level tasks. Decision trees are standardized into an edge-list representation and stored for dynamic retrieval. At runtime, a directed acyclic graph (DAG)-based orchestration mechanism iteratively retrieves only the outgoing edges of the current node, evaluates valid transitions via a dedicated LLM call, and delegates response generation to a separate inference step. The framework is agnostic to the underlying decision logic and model provider. Evaluated against single-prompt baselines across 10 foundation models using annotated turns from real clinical triage conversations. Arbor improves mean turn accuracy by 29.4 percentage points, reduces per-turn latency by 57.1%, and achieves an average 14.4x reduction in per-turn cost. These results indicate that architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines.

Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

TL;DR

The results indicate that architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines operating under single-prompt baselines.

Abstract

Large language models struggle to maintain strict adherence to structured workflows in high-stakes domains such as healthcare triage. Monolithic approaches that encode entire decision structures within a single prompt are prone to instruction-following degradation as prompt length increases, including lost-in-the-middle effects and context window overflow. To address this gap, we present Arbor, a framework that decomposes decision tree navigation into specialized, node-level tasks. Decision trees are standardized into an edge-list representation and stored for dynamic retrieval. At runtime, a directed acyclic graph (DAG)-based orchestration mechanism iteratively retrieves only the outgoing edges of the current node, evaluates valid transitions via a dedicated LLM call, and delegates response generation to a separate inference step. The framework is agnostic to the underlying decision logic and model provider. Evaluated against single-prompt baselines across 10 foundation models using annotated turns from real clinical triage conversations. Arbor improves mean turn accuracy by 29.4 percentage points, reduces per-turn latency by 57.1%, and achieves an average 14.4x reduction in per-turn cost. These results indicate that architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines.
Paper Structure (23 sections, 1 equation, 10 figures, 6 tables)

This paper contains 23 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of the proposed Arbor architecture, consisting of decision tree processing (shown at the bottom), an evaluation phase (top left), and a message generation phase (top right). The processing phase normalizes raw sources into an edge-list database, which enables the evaluation phase to dynamically retrieve outgoing edges. The evaluation phase then evaluates transitions via iterative LLM calls and updates the current node until no further transitions are taken. The message generation phase selects the appropriate prompt and produces the user-facing response.
  • Figure 2: Turn Accuracy. Bars show mean turn accuracy over five runs, error bars indicate the standard deviation.
  • Figure 3: Latency per Turn. Bars show mean response latency (in seconds) over five runs. The y-axis is in log scale.
  • Figure 4: Cost per Turn. Bars show the average cost per conversational turn in US Dollars ($). The y-axis is in log scale.
  • Figure 5: Message Approval Rate. Bars show the percentage of messages rated as clinically acceptable by physical therapists for each strategy.
  • ...and 5 more figures