Table of Contents
Fetching ...

CureAgent: A Training-Free Executor-Analyst Framework for Clinical Reasoning

Ting-Ting Xie, Yixin Zhang

TL;DR

This work tackles context-utilization failures in clinical reasoning LLMs by introducing a training-free Executor-Analyst framework that decouples precise tool retrieval (Executor) from high-level reasoning (Analyst). A Stratified Ensemble topology preserves evidentiary diversity and reduces information bottlenecks, achieving state-of-the-art results on CURE-Bench without end-to-end fine-tuning. The study also reveals a Context-Performance Paradox—longer reasoning contexts can introduce noise—and the Curse of Dimensionality in expanding tool spaces, proposing hierarchical indexing and training-free adaptation as remedies. The framework demonstrates strong, scalable potential for trustworthy AI-driven clinical decision support, with code released for reproducibility.

Abstract

Current clinical agent built on small LLMs, such as TxAgent suffer from a \textit{Context Utilization Failure}, where models successfully retrieve biomedical evidence due to supervised finetuning but fail to ground their diagnosis in that information. In this work, we propose the Executor-Analyst Framework, a modular architecture that decouples the syntactic precision of tool execution from the semantic robustness of clinical reasoning. By orchestrating specialized TxAgents (Executors) with long-context foundation models (Analysts), we mitigate the reasoning deficits observed in monolithic models. Beyond simple modularity, we demonstrate that a Stratified Ensemble strategy significantly outperforms global pooling by preserving evidentiary diversity, effectively addressing the information bottleneck. Furthermore, our stress tests reveal critical scaling insights: (1) a \textit{Context-Performance Paradox}, where extending reasoning contexts beyond 12k tokens introduces noise that degrades accuracy; and (2) the \textit{Curse of Dimensionality} in action spaces, where expanding toolsets necessitates hierarchical retrieval strategies. Crucially, our approach underscores the potential of training-free architectural engineering, achieving state-of-the-art performance on CURE-Bench without the need for expensive end-to-end finetuning. This provides a scalable, agile foundation for the next generation of trustworthy AI-driven therapeutics. Code has been released on https://github.com/June01/CureAgent.

CureAgent: A Training-Free Executor-Analyst Framework for Clinical Reasoning

TL;DR

This work tackles context-utilization failures in clinical reasoning LLMs by introducing a training-free Executor-Analyst framework that decouples precise tool retrieval (Executor) from high-level reasoning (Analyst). A Stratified Ensemble topology preserves evidentiary diversity and reduces information bottlenecks, achieving state-of-the-art results on CURE-Bench without end-to-end fine-tuning. The study also reveals a Context-Performance Paradox—longer reasoning contexts can introduce noise—and the Curse of Dimensionality in expanding tool spaces, proposing hierarchical indexing and training-free adaptation as remedies. The framework demonstrates strong, scalable potential for trustworthy AI-driven clinical decision support, with code released for reproducibility.

Abstract

Current clinical agent built on small LLMs, such as TxAgent suffer from a \textit{Context Utilization Failure}, where models successfully retrieve biomedical evidence due to supervised finetuning but fail to ground their diagnosis in that information. In this work, we propose the Executor-Analyst Framework, a modular architecture that decouples the syntactic precision of tool execution from the semantic robustness of clinical reasoning. By orchestrating specialized TxAgents (Executors) with long-context foundation models (Analysts), we mitigate the reasoning deficits observed in monolithic models. Beyond simple modularity, we demonstrate that a Stratified Ensemble strategy significantly outperforms global pooling by preserving evidentiary diversity, effectively addressing the information bottleneck. Furthermore, our stress tests reveal critical scaling insights: (1) a \textit{Context-Performance Paradox}, where extending reasoning contexts beyond 12k tokens introduces noise that degrades accuracy; and (2) the \textit{Curse of Dimensionality} in action spaces, where expanding toolsets necessitates hierarchical retrieval strategies. Crucially, our approach underscores the potential of training-free architectural engineering, achieving state-of-the-art performance on CURE-Bench without the need for expensive end-to-end finetuning. This provides a scalable, agile foundation for the next generation of trustworthy AI-driven therapeutics. Code has been released on https://github.com/June01/CureAgent.

Paper Structure

This paper contains 15 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Analysis of TxAgent failed cases on the validation set. (a)(b)(c) Examples of failed cases highlighting different error modes in TxAgent. (d) Distribution and types of failure modes ($n=73$ out of 413 multi-choice questions) observed in TxAgent, providing insights into common pitfalls and areas for improvement.
  • Figure 2: Overview of the Executor-Analyst Collaborative Framework. We address the context utilization failure by decoupling clinical reasoning into three specialized phases: (1) The Executor: This agent focuses solely on precise information retrieval, utilizing a self-consistency mechanism to aggregate top-$k$ evidence and reasoning trace from the ToolUniverse. (2) The Analyst: Freed from tool-use syntax, this long-context foundation model acts as a reasoner, performing fact verification and supplementing tool-missing information via search and synthesis on the noisy evidence stream. Note that while this figure depicts a linear flow, we further enhance performance via a Stratified Ensemble topology (detailed in Section \ref{['sec:topology']}). (3) Post-processing Module: A deterministic layer ensuring format compliance and deduplication.
  • Figure 3: Comparison of Topological Configurations. (a) Global Pooling aggregates all evidence into a single context. While reducing noise, it risks filtering out rare but valid cues. (b) Stratified Ensemble (Ours) partitions agents into parallel subgroups. Late Fusion approach preserves diverse reasoning paths until the final stage, mitigating the information bottleneck.
  • Figure 4: Performance of TxAgent with Self-consistency Mechanisms on phase2.
  • Figure 5: Analysis of performance vs context length.