Table of Contents
Fetching ...

Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding

Hanyin Wang, Zhenbang Wu, Gururaj Kolar, Hariprasad Korsapati, Brian Bartlett, Bryan Hull, Jimeng Sun

TL;DR

This work tackles automated Diagnosis-Related Group (DRG) coding from clinical notes, an out-of-distribution, knowledge-intensive task for large language models. It introduces DRG-Sapphire, a model built on Qwen2.5-7B and trained with Group Relative Policy Optimization (GRPO) using rule-based rewards to produce both DRG codes and physician-validated reasoning traces. A key finding is that RL performance scales approximately linearly with the logarithm of the number of supervised fine-tuning (SFT) examples, implying that investing in SFT can be more effective and computationally efficient than scaling RL alone for such tasks. The study also reveals that RL gains are bounded by base-model knowledge prior to RL, highlights the importance of knowledge infusion through SFT, and discusses broader challenges of applying RL to knowledge-intensive, out-of-distribution tasks in healthcare.

Abstract

Diagnosis-Related Group (DRG) codes are essential for hospital reimbursement and operations but require labor-intensive assignment. Large Language Models (LLMs) struggle with DRG coding due to the out-of-distribution (OOD) nature of the task: pretraining corpora rarely contain private clinical or billing data. We introduce DRG-Sapphire, which uses large-scale reinforcement learning (RL) for automated DRG coding from clinical notes. Built on Qwen2.5-7B and trained with Group Relative Policy Optimization (GRPO) using rule-based rewards, DRG-Sapphire introduces a series of RL enhancements to address domain-specific challenges not seen in previous mathematical tasks. Our model achieves state-of-the-art accuracy on the MIMIC-IV benchmark and generates physician-validated reasoning for DRG assignments, significantly enhancing explainability. Our study further sheds light on broader challenges of applying RL to knowledge-intensive, OOD tasks. We observe that RL performance scales approximately linearly with the logarithm of the number of supervised fine-tuning (SFT) examples, suggesting that RL effectiveness is fundamentally constrained by the domain knowledge encoded in the base model. For OOD tasks like DRG coding, strong RL performance requires sufficient knowledge infusion prior to RL. Consequently, scaling SFT may be more effective and computationally efficient than scaling RL alone for such tasks.

Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding

TL;DR

This work tackles automated Diagnosis-Related Group (DRG) coding from clinical notes, an out-of-distribution, knowledge-intensive task for large language models. It introduces DRG-Sapphire, a model built on Qwen2.5-7B and trained with Group Relative Policy Optimization (GRPO) using rule-based rewards to produce both DRG codes and physician-validated reasoning traces. A key finding is that RL performance scales approximately linearly with the logarithm of the number of supervised fine-tuning (SFT) examples, implying that investing in SFT can be more effective and computationally efficient than scaling RL alone for such tasks. The study also reveals that RL gains are bounded by base-model knowledge prior to RL, highlights the importance of knowledge infusion through SFT, and discusses broader challenges of applying RL to knowledge-intensive, out-of-distribution tasks in healthcare.

Abstract

Diagnosis-Related Group (DRG) codes are essential for hospital reimbursement and operations but require labor-intensive assignment. Large Language Models (LLMs) struggle with DRG coding due to the out-of-distribution (OOD) nature of the task: pretraining corpora rarely contain private clinical or billing data. We introduce DRG-Sapphire, which uses large-scale reinforcement learning (RL) for automated DRG coding from clinical notes. Built on Qwen2.5-7B and trained with Group Relative Policy Optimization (GRPO) using rule-based rewards, DRG-Sapphire introduces a series of RL enhancements to address domain-specific challenges not seen in previous mathematical tasks. Our model achieves state-of-the-art accuracy on the MIMIC-IV benchmark and generates physician-validated reasoning for DRG assignments, significantly enhancing explainability. Our study further sheds light on broader challenges of applying RL to knowledge-intensive, OOD tasks. We observe that RL performance scales approximately linearly with the logarithm of the number of supervised fine-tuning (SFT) examples, suggesting that RL effectiveness is fundamentally constrained by the domain knowledge encoded in the base model. For OOD tasks like DRG coding, strong RL performance requires sufficient knowledge infusion prior to RL. Consequently, scaling SFT may be more effective and computationally efficient than scaling RL alone for such tasks.

Paper Structure

This paper contains 75 sections, 11 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Main Results. (A) Accuracy of DRG coding on the MIMIC-IV test set (N=26,244). DRG-Sapphire outperforms proprietary reasoning models and the previous SOTA model, DRG-LLaMA. Notably, classification models could not generate reasoning for DRG code assignments. (B) Best RL performance increases linearly with the logarithm of the SFT sample sizes. Dashed line marks where 50% of training data was used for SFT. Best results from vanilla GRPO runs are shown.
  • Figure 2: Examples of Cognitive Behaviors.
  • Figure 3: Overview of Pipeline. We construct a CoT cold-start dataset using Qwen2.5-7B-Instruct, followed by SFT with this dataset and large-scale GRPO.
  • Figure 4: Expert Reader Study.
  • Figure 5: Impact of SFT-GRPO Data Ratios on DRG-Small Subset. (A–E) GRPO consistently improves Pass@1 and Maj@8 across all SFT ratios but reduces Pass@8. (F) Total training time decreases with higher SFT ratios, as GRPO is more time-consuming.
  • ...and 7 more figures