Table of Contents
Fetching ...

Empowering Locally Deployable Medical Agent via State Enhanced Logical Skills for FHIR-based Clinical Tasks

Wanrong Yang, Zhengliang Liu, Yuan Li, Bingjie Yan, Lingfang Li, Mingguang He, Dominik Wojtczak, Yalin Zheng, Danli Shi

TL;DR

This study demonstrates that equipping models with a dynamically updatable, state-enhanced cognitive scaffold is a privacy-preserving and computationally efficient pathway for local adaptation of AI agents to clinical information systems.

Abstract

While Large Language Models demonstrate immense potential as proactive Medical Agents, their real-world deployment is severely bottlenecked by data scarcity under privacy constraints. To overcome this, we propose State-Enhanced Logical-Skill Memory (SELSM), a training-free framework that distills simulated clinical trajectories into entity-agnostic operational rules within an abstract skill space. During inference, a Query-Anchored Two-Stage Retrieval mechanism dynamically fetches these entity-agnostic logical priors to guide the agent's step-by-step reasoning, effectively resolving the state polysemy problem. Evaluated on MedAgentBench -- the only authoritative high-fidelity virtual EHR sandbox benchmarked with real clinical data -- SELSM substantially elevates the zero-shot capabilities of locally deployable foundation models (30B--32B parameters). Notably, on the Qwen3-30B-A3B backbone, our framework completely eliminates task chain breakdowns to achieve a 100\% completion rate, boosting the overall success rate by an absolute 22.67\% and significantly outperforming existing memory-augmented baselines. This study demonstrates that equipping models with a dynamically updatable, state-enhanced cognitive scaffold is a privacy-preserving and computationally efficient pathway for local adaptation of AI agents to clinical information systems. While currently validated on FHIR-based EHR interactions as an initial step, the entity-agnostic design of SELSM provides a principled foundation toward broader clinical deployment.

Empowering Locally Deployable Medical Agent via State Enhanced Logical Skills for FHIR-based Clinical Tasks

TL;DR

This study demonstrates that equipping models with a dynamically updatable, state-enhanced cognitive scaffold is a privacy-preserving and computationally efficient pathway for local adaptation of AI agents to clinical information systems.

Abstract

While Large Language Models demonstrate immense potential as proactive Medical Agents, their real-world deployment is severely bottlenecked by data scarcity under privacy constraints. To overcome this, we propose State-Enhanced Logical-Skill Memory (SELSM), a training-free framework that distills simulated clinical trajectories into entity-agnostic operational rules within an abstract skill space. During inference, a Query-Anchored Two-Stage Retrieval mechanism dynamically fetches these entity-agnostic logical priors to guide the agent's step-by-step reasoning, effectively resolving the state polysemy problem. Evaluated on MedAgentBench -- the only authoritative high-fidelity virtual EHR sandbox benchmarked with real clinical data -- SELSM substantially elevates the zero-shot capabilities of locally deployable foundation models (30B--32B parameters). Notably, on the Qwen3-30B-A3B backbone, our framework completely eliminates task chain breakdowns to achieve a 100\% completion rate, boosting the overall success rate by an absolute 22.67\% and significantly outperforming existing memory-augmented baselines. This study demonstrates that equipping models with a dynamically updatable, state-enhanced cognitive scaffold is a privacy-preserving and computationally efficient pathway for local adaptation of AI agents to clinical information systems. While currently validated on FHIR-based EHR interactions as an initial step, the entity-agnostic design of SELSM provides a principled foundation toward broader clinical deployment.
Paper Structure (26 sections, 17 equations, 6 figures, 3 tables)

This paper contains 26 sections, 17 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of the SELSM framework. (Top-left) Cross-institutional deployment context: heterogeneous hospital systems (EHR, LIS, PACS, etc.) operate under institution-specific protocols, while a locally deployed LLM, guided by medical professionals, issues privacy-preserving operational queries. (Top-right)Logical Skill Distillation (Phase 1): the agent interacts with a system simulator in a closed loop, where it observes the current state $s$, executes an action $a$ via the policy $p = \pi(a \mid s)$, and receives the system response $o$. The Logical Skill Generator $\mathcal{G}$ parses each trajectory $\tau = (s, a, o)$ through a context encoder and experience decoder to produce entity-agnostic logical skills $e = \mathcal{G}(\tau \mid \theta)$, comprising Operational Logic, canonical examples, and reasoning traces. (Bottom)Query-Anchored Two-Stage Retrieval (Phase 3): upon receiving a new task, the system first performs Task-Level Query Filtering by scoring stored records $\mathcal{R} = (q, \mathcal{T})$ via query similarity $p_q$, then executes Transition-Level State Ranking over candidate internal states via state similarity $p_s$, and finally integrates the retrieved skill with the current state to guide the agent toward a better operational action.
  • Figure 2: Failure mode distribution and performance improvement analysis. (a)--(c) Each stacked bar decomposes all 300 tasks into four mutually exclusive outcomes: Correct (task completed with the correct answer), Incorrect (task completed but with a wrong answer), Invalid Action (terminated due to an invalid API call), and Task Limit (terminated after exceeding the maximum number of interaction turns) for GLM4-32B, Qwen3-30B-A3B, and Qwen3-32B, respectively. (d) Absolute improvement of our method over the Baseline in percentage points (pp) across three metrics: Overall Success Rate, Query Success Rate, and Action Success Rate.
  • Figure 3: Conversation efficiency analysis. (a)--(c) One-Shot Correct Rate (OSR; Eq. \ref{['eq:osr']}) for each method on GLM4-32B, Qwen3-30B-A3B, and Qwen3-32B, respectively. (d) Average number of conversation turns per task across all three backbone models.
  • Figure 4: Multi-dimensional comparison of four methods on the Qwen3-30B-A3B backbone. Five normalized dimensions are shown: Success Rate ($\text{SR}/100$; Eq. \ref{['eq:sr']}), Completion Rate ($\text{TC}/100$; Eq. \ref{['eq:tc']}), Error Robustness (ER; Eq. \ref{['eq:er']}), Efficiency ($\text{OSR}/100$; Eq. \ref{['eq:osr']}), and Query-Action Balance (QAB; Eq. \ref{['eq:qab']}). A larger enclosed area indicates better overall performance.
  • Figure 5: Token efficiency analysis. Each point represents one method--model configuration, with the $x$-axis showing $\bar{C}_{\mathrm{tok}}$ (Eq. \ref{['eq:token_cost']}) and the $y$-axis showing SR (Eq. \ref{['eq:sr']}). Colors denote methods (Baseline, A-Mem, ExpeL, Ours) and marker shapes denote backbone models (square: GLM4-32B, circle: Qwen3-30B-A3B, triangle: Qwen3-32B). Points closer to the upper-left corner indicate higher accuracy at lower token cost.
  • ...and 1 more figures