Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning
Yuxuan Zhou, Yubin Wang, Bin Wang, Chen Ning, Xien Liu, Ji Wu, Jianye Hao
TL;DR
This work targets the gap between LLMs' benchmark performance and real-world medical use by introducing MuSeR, a data-driven framework that strengthens context-awareness through attribute-conditioned query synthesis, multifaceted self-refinement across decision-making, communication, and safety, and knowledge-distillation-enhanced fine-tuning. By generating 100k synthetic queries and employing a two-stage training process (KD from a strong teacher followed by SFT with refined responses), MuSeR yields significant gains on HealthBench across multiple backbone LLMs, achieving state-of-the-art results among open-source models. The approach demonstrates that context-aware generation and refinement can meaningfully improve safe, helpful, and patient-tailored medical advice, with practical implications for deploying LLMs in real-world healthcare settings and potential extension to other domains. The combination of data synthesis, self-evaluation, and distillation offers a scalable, cost-effective path to align LLM behavior with human needs in complex, safety-critical contexts.
Abstract
Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then responds to these queries, self-evaluates its answers along three key facets, and refines its responses to better align with the requirements of each facet. Finally, the queries and refined responses are used for supervised fine-tuning to reinforce the model's context-awareness ability. Evaluation results on the latest HealthBench dataset demonstrate that our method significantly improves LLM performance across multiple aspects, with particularly notable gains in the context-awareness axis. Furthermore, by incorporating knowledge distillation with the proposed method, the performance of a smaller backbone LLM (e.g., Qwen3-32B) surpasses its teacher model, achieving a new SOTA across all open-source LLMs on HealthBench (63.8%) and its hard subset (43.1%). Code and dataset will be released at https://muser-llm.github.io.
