Table of Contents
Fetching ...

RuleAlign: Making Large Language Models Better Physicians with Diagnostic Rule Alignment

Xiaohan Wang, Xiaoyan Yang, Yuqi Zhu, Yue Shen, Jian Wang, Peng Wei, Lei Liang, Jinjie Gu, Huajun Chen, Ningyu Zhang

TL;DR

RuleAlign tackles the gap between LLM-based medical dialogue and physician-level diagnostic reasoning by enforcing rule-based outputs through a diagnostic-rule framework. The authors build UrologyRD, a rule-driven dialogue dataset, and train models via a two-phase process: supervised fine-tuning followed by offline preference optimization that favors rule-compliant responses. They demonstrate that RuleAlign improves key metrics such as perplexity, Rouge, and BLEU across multiple base models and enhances multidimensional SP testing scores, indicating better information gathering, guidance, and logical deduction. The work advances AI physician capabilities by providing a scalable method to encode professional diagnostic rules into LLM behavior and offers a practical dataset and evaluation framework for future expansion.

Abstract

Large Language Models (LLMs) like GPT-4, MedPaLM-2, and Med-Gemini achieve performance competitively with human experts across various medical benchmarks. However, they still face challenges in making professional diagnoses akin to physicians, particularly in efficiently gathering patient information and reasoning the final diagnosis. To this end, we introduce the RuleAlign framework, designed to align LLMs with specific diagnostic rules. We develop a medical dialogue dataset comprising rule-based communications between patients and physicians and design an alignment learning approach through preference learning. Experimental results demonstrate the effectiveness of the proposed approach. We hope that our work can serve as an inspiration for exploring the potential of LLMs as AI physicians.

RuleAlign: Making Large Language Models Better Physicians with Diagnostic Rule Alignment

TL;DR

RuleAlign tackles the gap between LLM-based medical dialogue and physician-level diagnostic reasoning by enforcing rule-based outputs through a diagnostic-rule framework. The authors build UrologyRD, a rule-driven dialogue dataset, and train models via a two-phase process: supervised fine-tuning followed by offline preference optimization that favors rule-compliant responses. They demonstrate that RuleAlign improves key metrics such as perplexity, Rouge, and BLEU across multiple base models and enhances multidimensional SP testing scores, indicating better information gathering, guidance, and logical deduction. The work advances AI physician capabilities by providing a scalable method to encode professional diagnostic rules into LLM behavior and offers a practical dataset and evaluation framework for future expansion.

Abstract

Large Language Models (LLMs) like GPT-4, MedPaLM-2, and Med-Gemini achieve performance competitively with human experts across various medical benchmarks. However, they still face challenges in making professional diagnoses akin to physicians, particularly in efficiently gathering patient information and reasoning the final diagnosis. To this end, we introduce the RuleAlign framework, designed to align LLMs with specific diagnostic rules. We develop a medical dialogue dataset comprising rule-based communications between patients and physicians and design an alignment learning approach through preference learning. Experimental results demonstrate the effectiveness of the proposed approach. We hope that our work can serve as an inspiration for exploring the potential of LLMs as AI physicians.
Paper Structure (29 sections, 4 equations, 10 figures, 6 tables)

This paper contains 29 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: In medical practice, physicians need to gather sufficient patient information through inquiry to make the final diagnostic opinion. As professional physicians, their questions are usually rule-aligned, making the entire process efficient and logical.
  • Figure 2: Physicians' diagnostic rules are specified for specific diseases and closely associated evidence in this study. They adhere to not only general diagnostic principles but also specialized disease requirements.
  • Figure 3: The pipeline of RuleAlign. The optimization contains distinct strategies to build the preference pairs without extra human-annotation resource.
  • Figure 4: The results for preference pairs of different number sizes performed in Qwen1.5-7B-chat.
  • Figure 5: This radar plot shows the SP testing ranks of different methods, with closer to the outer edge indicating higher ranking and better performance.
  • ...and 5 more figures