Table of Contents
Fetching ...

BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

Yubin Kim, Zhiyuan Hu, Hyewon Jeong, Eugene Park, Shuyue Stella Li, Chanwoo Park, Shiyun Xiong, MingYu Lu, Hyeonhoon Lee, Xin Liu, Daniel McDuff, Cynthia Breazeal, Samir Tulebaev, Hae Won Park

TL;DR

This work tackles the challenge of enabling clinical LLMs to operate with adaptive proactivity rather than purely reactive behavior. It introduces BehaviorBench, a drama-anchored, multi-turn benchmark derived from NEJM cases to evaluate reactive and proactive capabilities, and BehaviorSFT, a behavior-conditioned fine-tuning method that uses explicit behavior tokens to steer model responses along a reactive-proactive spectrum. The approach yields up to 97.3% macro F1 on BehaviorBench, with pronounced gains on proactive tasks, and clinician evaluations indicate more realistic, safer, and appropriately proactive behavior compared to standard fine-tuning or explicit instruction baselines. These findings demonstrate that explicit behavioral conditioning can substantially improve the reliability and clinical usefulness of AI assistants in high-stakes healthcare settings, with implications for safer deployment and future expansion of proactive AI in medicine.

Abstract

Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs' inconsistent proactivity. To address this, we propose BehaviorSFT, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from 95.0% to 96.5% for Qwen2.5-7B-Ins). Crucially, blind clinician evaluations confirmed BehaviorSFT-trained agents exhibit more realistic clinical behavior, striking a superior balance between helpful proactivity (e.g., timely, relevant suggestions) and necessary restraint (e.g., avoiding over-intervention) versus standard fine-tuning or explicit instructed agents.

BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

TL;DR

This work tackles the challenge of enabling clinical LLMs to operate with adaptive proactivity rather than purely reactive behavior. It introduces BehaviorBench, a drama-anchored, multi-turn benchmark derived from NEJM cases to evaluate reactive and proactive capabilities, and BehaviorSFT, a behavior-conditioned fine-tuning method that uses explicit behavior tokens to steer model responses along a reactive-proactive spectrum. The approach yields up to 97.3% macro F1 on BehaviorBench, with pronounced gains on proactive tasks, and clinician evaluations indicate more realistic, safer, and appropriately proactive behavior compared to standard fine-tuning or explicit instruction baselines. These findings demonstrate that explicit behavioral conditioning can substantially improve the reliability and clinical usefulness of AI assistants in high-stakes healthcare settings, with implications for safer deployment and future expansion of proactive AI in medicine.

Abstract

Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs' inconsistent proactivity. To address this, we propose BehaviorSFT, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from 95.0% to 96.5% for Qwen2.5-7B-Ins). Crucially, blind clinician evaluations confirmed BehaviorSFT-trained agents exhibit more realistic clinical behavior, striking a superior balance between helpful proactivity (e.g., timely, relevant suggestions) and necessary restraint (e.g., avoiding over-intervention) versus standard fine-tuning or explicit instructed agents.

Paper Structure

This paper contains 40 sections, 1 equation, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Six representative tasks from BehaviorBench, showcasing the spectrum of agent behaviors in clinical settings. The figure illustrates (a-c, f) proactive tasks where the LLM agent identifies issues or offers insights without direct prompting, and (b, d, e) reactive tasks responding to explicit clinician queries.
  • Figure 2: Density distributions of (I) Specificity and (II) Implicitness scores for Baseline, BehaviorSFT, and GeneralSFT agent outputs. (I) Specificity: Both fine-tuned models (BehaviorSFT and GeneralSFT) markedly improve output specificity over the Baseline, with distributions concentrated at high scores ($\sim$0.9). (II) Implicitness: Distinct implicitness profiles emerge: GeneralSFT is the most explicit (lowest scores, $\sim$0.6-0.7), the Baseline is the most implicit (highest scores, $\sim$0.7-0.9), while BehaviorSFT exhibits a moderate, intermediate level of implicitness ($\sim$0.7-0.8).
  • Figure 3: G-Eval with gpt-4o-mini as evaluator of Qwen-2.5-7B-Ins responses across four key metrics. We compare the average scores for the Baseline model, our proposed BehaviorSFT, and GeneralSFT. BehaviorSFT consistently outperforms the Baseline across all metrics and demonstrates competitive or superior performance compared to GeneralSFT.
  • Figure 4: The Landscape of Healthcare AI Systems and Enabling Frameworks. Systems are positioned based on their primary Task Scope (Narrow, Medium, or Broad) and their demonstrated level of System Autonomy. The autonomy levels are derived from the Six-Level Taxonomy for Healthcare AI Agent Autonomy (detailed in Table \ref{['tab:autonomy_taxonomy_detailed']}), ranging from L0-L1 (Assistance & Reactive Info) through L3 (Conditional Automation/Contextual Proactivity) to L4-L5 (High/Full Automation). Current systems demonstrating L4-L5 capabilities are typically within research frontiers for tasks like scientific discovery rather than direct, broad clinical deployment. Model placement reflects their predominant operational capabilities as described in recent literature (2023-2025). The progression towards higher autonomy, particularly the transition from L2 (Reactive Support) to L3 (Contextual Proactivity), necessitates significant advancements in behavioral adaptation to ensure safe and effective operation in nuanced healthcare contexts. Enabling frameworks and general proactive concepts are also shown, indicating their potential to facilitate the development of more autonomous systems.
  • Figure 5: Performance comparison on BehaviorBench for Few-Shot (k=3); Gen. SFT, and our proposed BehaviorSFT. Tasks are colored based on task category: Reactive, Balanced, and Proactive. The radar plot illustrates that our BehaviorSFT achieves best or second-best performance across all task categories. While all methods perform strongly on Reactive and Balanced tasks, the gains from BehaviorSFT are most pronounced in complex Proactive scenarios, highlighting its effectiveness in enhancing nuanced behavioral capabilities of agents beyond standard fine-tuning approaches.
  • ...and 14 more figures