Table of Contents
Fetching ...

ClinAlign: Scaling Healthcare Alignment from Clinician Preference

Shiwei Lyu, Xidong Wang, Lei Liu, Hao Zhu, Chaohe Zhang, Jian Wang, Jinjie Gu, Benyou Wang, Yue Shen

TL;DR

ClinAlign presents HealthRubrics, a physician-verified dataset of 7,034 preference examples, and HealthPrinciples, a taxonomy of 119 reusable principles, to enable scalable, rubric-grounded supervision for medical LLM alignment. The approach combines offline supervision with a principle-based data expansion and an inference-time tool that guides self-revision, achieving strong performance on HealthBench-Hard and Arena-Hard-v2 without increasing model size. Key findings show that expert-validated rubrics yield the largest gains, while principle rubrics provide competitive results through broader coverage, and that inference-time rubric guidance yields consistent improvements with diminishing returns. The work provides a practical release of data, principles, and tooling to accelerate safe and reliable clinical AI development, while acknowledging limits in intrinsic reasoning and saturation of inference-time benefits.”

Abstract

Although large language models (LLMs) demonstrate expert-level medical knowledge, aligning their open-ended outputs with fine-grained clinician preferences remains challenging. Existing methods often rely on coarse objectives or unreliable automated judges that are weakly grounded in professional guidelines. We propose a two-stage framework to address this gap. First, we introduce HealthRubrics, a dataset of 7,034 physician-verified preference examples in which clinicians refine LLM-drafted rubrics to meet rigorous medical standards. Second, we distill these rubrics into HealthPrinciples: 119 broadly reusable, clinically grounded principles organized by clinical dimensions, enabling scalable supervision beyond manual annotation. We use HealthPrinciples for (1) offline alignment by synthesizing rubrics for unlabeled queries and (2) an inference-time tool for guided self-revision. A 30B parameter model that activates only 3B parameters at inference trained with our framework achieves 33.4% on HealthBench-Hard, outperforming much larger models including Deepseek-R1 and o3, establishing a resource-efficient baseline for clinical alignment.

ClinAlign: Scaling Healthcare Alignment from Clinician Preference

TL;DR

ClinAlign presents HealthRubrics, a physician-verified dataset of 7,034 preference examples, and HealthPrinciples, a taxonomy of 119 reusable principles, to enable scalable, rubric-grounded supervision for medical LLM alignment. The approach combines offline supervision with a principle-based data expansion and an inference-time tool that guides self-revision, achieving strong performance on HealthBench-Hard and Arena-Hard-v2 without increasing model size. Key findings show that expert-validated rubrics yield the largest gains, while principle rubrics provide competitive results through broader coverage, and that inference-time rubric guidance yields consistent improvements with diminishing returns. The work provides a practical release of data, principles, and tooling to accelerate safe and reliable clinical AI development, while acknowledging limits in intrinsic reasoning and saturation of inference-time benefits.”

Abstract

Although large language models (LLMs) demonstrate expert-level medical knowledge, aligning their open-ended outputs with fine-grained clinician preferences remains challenging. Existing methods often rely on coarse objectives or unreliable automated judges that are weakly grounded in professional guidelines. We propose a two-stage framework to address this gap. First, we introduce HealthRubrics, a dataset of 7,034 physician-verified preference examples in which clinicians refine LLM-drafted rubrics to meet rigorous medical standards. Second, we distill these rubrics into HealthPrinciples: 119 broadly reusable, clinically grounded principles organized by clinical dimensions, enabling scalable supervision beyond manual annotation. We use HealthPrinciples for (1) offline alignment by synthesizing rubrics for unlabeled queries and (2) an inference-time tool for guided self-revision. A 30B parameter model that activates only 3B parameters at inference trained with our framework achieves 33.4% on HealthBench-Hard, outperforming much larger models including Deepseek-R1 and o3, establishing a resource-efficient baseline for clinical alignment.
Paper Structure (53 sections, 17 figures, 3 tables)

This paper contains 53 sections, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Scatter plot of performance where the x-axis shows the HealthBench-hard score and the y-axis shows the Arena-Hard-v2 Creative Writing score. Marker size is proportional to the model parameter count.
  • Figure 2: Method overview.(Top) HealthRubrics: we draft rubrics with GPT-5.1 for real-world medical queries and multi-model responses, then have physicians refine them into validated preference supervision. (Bottom) HealthPrinciples: we distill recurring rubric patterns into reusable, scenario-specific principles, used to (i) scale rubric-grounded supervision to new questions and (ii) provide rubric references for inference-time self-revision.
  • Figure 3: HealthBench scores vs. training epoch on a random 70/30 split with 3K training questions and 2K held-out questions, evaluated using the official HealthBench script.
  • Figure 4: Prompt used to draft per-instance rubrics from a clinician-labeled pairwise preference.
  • Figure 5: Prompt used to rewrite rubrics according to physician revision and review feedback.
  • ...and 12 more figures