Table of Contents
Fetching ...

The Alignment Paradox of Medical Large Language Models in Infertility Care: Decoupling Algorithmic Improvement from Clinical Decision-making Quality

Dou Liu, Ying Long, Sophia Zuoqiu, Kaipeng Xie, Runze Yang, Di Liu, Kang Li, Yiting Lin, Hanyi Liu, Rong Yin, Tian Tang

TL;DR

This study addresses how post-training alignment of medical LLMs interacts with real-world infertility decision-making, revealing an alignment paradox where algorithmic improvements in metrics do not translate into higher clinical trust. It systematically compares SFT, DPO, GRPO, and ICL using a real-world infertility dataset and a dual evaluation framework that combines automatic benchmarks with blinded doctor-in-the-loop assessments, augmented by a pyramid-shaped dataset to stress long-tail cases. Key findings show GRPO achieves the best automated performance, while clinicians prefer SFT for reasoning clarity and feasibility, and blinded evaluations still favor SFT over GRPO in practice, highlighting a gap between benchmark accuracy and clinical trust. The work argues for alignment strategies that prioritize clinically interpretable reasoning and actionable feasibility, and it suggests building evaluation and reward schemes that reflect real-world clinical priorities, including multi-center data and scalable reasoning supervision.

Abstract

Large language models (LLMs) are increasingly adopted in clinical decision support, yet aligning them with the multifaceted reasoning pathways of real-world medicine remains a major challenge. Using more than 8,000 infertility treatment records, we systematically evaluate four alignment strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and In-Context Learning (ICL) through a dual-layer framework combining automatic benchmarks with blinded doctor-in-the-loop assessments. GRPO achieves the highest algorithmic accuracy across multiple decision layers, confirming the value of reinforcement-based optimization for structured prediction tasks. However, clinicians consistently prefer the SFT model, citing clearer reasoning processes (p = 0.035) and higher therapeutic feasibility (p = 0.019). In blinded pairwise comparisons, SFT attains the highest winning rate (51.2%), outperforming both GRPO (26.2%) and even physicians' original decisions (22.7%). These results reveal an alignment paradox: algorithmic improvements do not necessarily translate into higher clinical trust, and may diverge from human-centered preferences. Our findings highlight the need for alignment strategies that prioritize clinically interpretable and practically feasible reasoning, rather than solely optimizing decision-level accuracy.

The Alignment Paradox of Medical Large Language Models in Infertility Care: Decoupling Algorithmic Improvement from Clinical Decision-making Quality

TL;DR

This study addresses how post-training alignment of medical LLMs interacts with real-world infertility decision-making, revealing an alignment paradox where algorithmic improvements in metrics do not translate into higher clinical trust. It systematically compares SFT, DPO, GRPO, and ICL using a real-world infertility dataset and a dual evaluation framework that combines automatic benchmarks with blinded doctor-in-the-loop assessments, augmented by a pyramid-shaped dataset to stress long-tail cases. Key findings show GRPO achieves the best automated performance, while clinicians prefer SFT for reasoning clarity and feasibility, and blinded evaluations still favor SFT over GRPO in practice, highlighting a gap between benchmark accuracy and clinical trust. The work argues for alignment strategies that prioritize clinically interpretable reasoning and actionable feasibility, and it suggests building evaluation and reward schemes that reflect real-world clinical priorities, including multi-center data and scalable reasoning supervision.

Abstract

Large language models (LLMs) are increasingly adopted in clinical decision support, yet aligning them with the multifaceted reasoning pathways of real-world medicine remains a major challenge. Using more than 8,000 infertility treatment records, we systematically evaluate four alignment strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and In-Context Learning (ICL) through a dual-layer framework combining automatic benchmarks with blinded doctor-in-the-loop assessments. GRPO achieves the highest algorithmic accuracy across multiple decision layers, confirming the value of reinforcement-based optimization for structured prediction tasks. However, clinicians consistently prefer the SFT model, citing clearer reasoning processes (p = 0.035) and higher therapeutic feasibility (p = 0.019). In blinded pairwise comparisons, SFT attains the highest winning rate (51.2%), outperforming both GRPO (26.2%) and even physicians' original decisions (22.7%). These results reveal an alignment paradox: algorithmic improvements do not necessarily translate into higher clinical trust, and may diverge from human-centered preferences. Our findings highlight the need for alignment strategies that prioritize clinically interpretable and practically feasible reasoning, rather than solely optimizing decision-level accuracy.

Paper Structure

This paper contains 22 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the Alignment and Evaluation Framework
  • Figure 2: Doctor-in-the-loop evaluation.
  • Figure 3: Reinforcement learning–based alignment improves F1 scores in IVF and PGT, but consistently reduces performance in ICSI. All values are reported as percentages.
  • Figure 4: Differences between GRPO and SFT in Subtype Analysis
  • Figure 5: Cross-metric confusion analysis for COS regimen and ART prediction.