ClinAlign: Scaling Healthcare Alignment from Clinician Preference
Shiwei Lyu, Xidong Wang, Lei Liu, Hao Zhu, Chaohe Zhang, Jian Wang, Jinjie Gu, Benyou Wang, Yue Shen
TL;DR
ClinAlign presents HealthRubrics, a physician-verified dataset of 7,034 preference examples, and HealthPrinciples, a taxonomy of 119 reusable principles, to enable scalable, rubric-grounded supervision for medical LLM alignment. The approach combines offline supervision with a principle-based data expansion and an inference-time tool that guides self-revision, achieving strong performance on HealthBench-Hard and Arena-Hard-v2 without increasing model size. Key findings show that expert-validated rubrics yield the largest gains, while principle rubrics provide competitive results through broader coverage, and that inference-time rubric guidance yields consistent improvements with diminishing returns. The work provides a practical release of data, principles, and tooling to accelerate safe and reliable clinical AI development, while acknowledging limits in intrinsic reasoning and saturation of inference-time benefits.”
Abstract
Although large language models (LLMs) demonstrate expert-level medical knowledge, aligning their open-ended outputs with fine-grained clinician preferences remains challenging. Existing methods often rely on coarse objectives or unreliable automated judges that are weakly grounded in professional guidelines. We propose a two-stage framework to address this gap. First, we introduce HealthRubrics, a dataset of 7,034 physician-verified preference examples in which clinicians refine LLM-drafted rubrics to meet rigorous medical standards. Second, we distill these rubrics into HealthPrinciples: 119 broadly reusable, clinically grounded principles organized by clinical dimensions, enabling scalable supervision beyond manual annotation. We use HealthPrinciples for (1) offline alignment by synthesizing rubrics for unlabeled queries and (2) an inference-time tool for guided self-revision. A 30B parameter model that activates only 3B parameters at inference trained with our framework achieves 33.4% on HealthBench-Hard, outperforming much larger models including Deepseek-R1 and o3, establishing a resource-efficient baseline for clinical alignment.
