Table of Contents
Fetching ...

Hengqin-RA-v1: Advanced Large Language Model for Diagnosis and Treatment of Rheumatoid Arthritis with Dataset based Traditional Chinese Medicine

Yishen Liu, Shengda Luo, Zishao Zhong, Tongtong Wu, Jianguo Zhang, Peiyao Ou, Yong Liang, Liang Liu, Hudan Pan

TL;DR

This work tackles the biases and data gaps of English-centric LLMs in Chinese medical and TCM contexts by introducing Hengqin-RA-v1, the first RA-focused LLM tailored to Traditional Chinese Medicine, and HQ-GCM-RA-C1, a comprehensive RA-centric dataset derived from ancient texts, modern literature, and clinical materials. The authors propose a progressive, data-centric training pipeline that combines structured medical-record reasoning with retrieval-enhanced generation and instance-oriented context, anchored by the CMeKG knowledge graph and domain-specific instruction data. Experimental results show Hengqin-RA-v1 achieving superior performance on TCM-RA tasks, including a 54% passing rate on TCM exams and qualitative improvements in diagnostic and treatment guidance, though some clinical-detail gaps remain. The dataset and model collectively aim to reduce bias, improve cultural and clinical fidelity in Chinese RA care, and pave the way for subsequent generations (v2, v3) and broader arthritis-focused TCMed AI systems.

Abstract

Large language models (LLMs) primarily trained on English texts, often face biases and inaccuracies in Chinese contexts. Their limitations are pronounced in fields like Traditional Chinese Medicine (TCM), where cultural and clinical subtleties are vital, further hindered by a lack of domain-specific data, such as rheumatoid arthritis (RA). To address these issues, this paper introduces Hengqin-RA-v1, the first large language model specifically tailored for TCM with a focus on diagnosing and treating RA. We also present HQ-GCM-RA-C1, a comprehensive RA-specific dataset curated from ancient Chinese medical literature, classical texts, and modern clinical studies. This dataset empowers Hengqin-RA-v1 to deliver accurate and culturally informed responses, effectively bridging the gaps left by general-purpose models. Extensive experiments demonstrate that Hengqin-RA-v1 outperforms state-of-the-art models, even surpassing the diagnostic accuracy of TCM practitioners in certain cases.

Hengqin-RA-v1: Advanced Large Language Model for Diagnosis and Treatment of Rheumatoid Arthritis with Dataset based Traditional Chinese Medicine

TL;DR

This work tackles the biases and data gaps of English-centric LLMs in Chinese medical and TCM contexts by introducing Hengqin-RA-v1, the first RA-focused LLM tailored to Traditional Chinese Medicine, and HQ-GCM-RA-C1, a comprehensive RA-centric dataset derived from ancient texts, modern literature, and clinical materials. The authors propose a progressive, data-centric training pipeline that combines structured medical-record reasoning with retrieval-enhanced generation and instance-oriented context, anchored by the CMeKG knowledge graph and domain-specific instruction data. Experimental results show Hengqin-RA-v1 achieving superior performance on TCM-RA tasks, including a 54% passing rate on TCM exams and qualitative improvements in diagnostic and treatment guidance, though some clinical-detail gaps remain. The dataset and model collectively aim to reduce bias, improve cultural and clinical fidelity in Chinese RA care, and pave the way for subsequent generations (v2, v3) and broader arthritis-focused TCMed AI systems.

Abstract

Large language models (LLMs) primarily trained on English texts, often face biases and inaccuracies in Chinese contexts. Their limitations are pronounced in fields like Traditional Chinese Medicine (TCM), where cultural and clinical subtleties are vital, further hindered by a lack of domain-specific data, such as rheumatoid arthritis (RA). To address these issues, this paper introduces Hengqin-RA-v1, the first large language model specifically tailored for TCM with a focus on diagnosing and treating RA. We also present HQ-GCM-RA-C1, a comprehensive RA-specific dataset curated from ancient Chinese medical literature, classical texts, and modern clinical studies. This dataset empowers Hengqin-RA-v1 to deliver accurate and culturally informed responses, effectively bridging the gaps left by general-purpose models. Extensive experiments demonstrate that Hengqin-RA-v1 outperforms state-of-the-art models, even surpassing the diagnostic accuracy of TCM practitioners in certain cases.
Paper Structure (9 sections, 5 figures, 2 tables)

This paper contains 9 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The progressive training workflow of Hengqin-RA-v1 starts with HQ-GCM-RA-C1, followed by Data Segmentation Conversation Set Generation, Full Fine-Tuning, and LoRA. Incremental Fine-Tuning then adjusts select parameters while preserving knowledge, branching into Instance-Oriented and Entity-Relationship-Oriented Retrieval Enhancements. Parallelly, TCM diagnostic logic is improved using structured medical records and refined with a sliding window for corpus context.
  • Figure 2: Enhancing TCM diagnostic and treatment logic involves a data processing pipeline starting with Raw Data as the initial input. The data is segmented (Raw Data-$1$ to Raw Data-$N$) and combined with a System Prompt at the Input stage. This input undergoes task-specific processing in the Task phase, transforming it into the Target Data Format, showcasing a structured progression from raw data to an organized output.
  • Figure 3: The sliding window approach enhances corpus context logic by extracting BibTeX entries via pdf2bib, aligning journal names using fuzzy matching, and organizing paper IDs, journal names, and categories into a structured table for improved data context.
  • Figure 4: The composition structure of HQ-GCM-RA-C1. It outlines a structured workflow under HQ-GCM-RA-C1 for processing and translating a Chinese corpus related to TCM and medical concepts, including a Q/A Pairs section, where questions and answers derived from the Chinese corpus. The Description section provides detailed explanations of medical objects, like "rheumatoid nodules in the lungs". Additionally, references to relevant books with their ISBNs are listed, integrating literature to enhance accuracy and context. Blue arrows connect these components, illustrating the transformation from corpus processing to enriched contextual data.
  • Figure 5: Experimental results showcase the medical diagnosis recommendations provided by Hengqin-RA-v1 for a specific patient case. The expert's assessment of the generated visit is distinctly highlighted in red, emphasizing critical evaluations and insights into the system's performance. This setup facilitates a thorough analysis of the model's capabilities in generating accurate and contextually relevant diagnostic recommendations while ensuring expert validation.