Table of Contents
Fetching ...

Traditional Chinese Medicine Case Analysis System for High-Level Semantic Abstraction: Optimized with Prompt and RAG

Peng Xu, Hongjin Wu, Jinle Wang, Rongjia Lin, Liwei Tan

TL;DR

A technical plan for building a clinical case database for Traditional Chinese Medicine (TCM) using web scraping and combining two-stage retrieval method with keyword matching via Jieba significantly enhanced the accuracy of model outputs.

Abstract

This paper details a technical plan for building a clinical case database for Traditional Chinese Medicine (TCM) using web scraping. Leveraging multiple platforms, including 360doc, we gathered over 5,000 TCM clinical cases, performed data cleaning, and structured the dataset with crucial fields such as patient details, pathogenesis, syndromes, and annotations. Using the $Baidu\_ERNIE\_Speed\_128K$ API, we removed redundant information and generated the final answers through the $DeepSeekv2$ API, outputting results in standard JSON format. We optimized data recall with RAG and rerank techniques during retrieval and developed a hybrid matching scheme. By combining two-stage retrieval method with keyword matching via Jieba, we significantly enhanced the accuracy of model outputs.

Traditional Chinese Medicine Case Analysis System for High-Level Semantic Abstraction: Optimized with Prompt and RAG

TL;DR

A technical plan for building a clinical case database for Traditional Chinese Medicine (TCM) using web scraping and combining two-stage retrieval method with keyword matching via Jieba significantly enhanced the accuracy of model outputs.

Abstract

This paper details a technical plan for building a clinical case database for Traditional Chinese Medicine (TCM) using web scraping. Leveraging multiple platforms, including 360doc, we gathered over 5,000 TCM clinical cases, performed data cleaning, and structured the dataset with crucial fields such as patient details, pathogenesis, syndromes, and annotations. Using the API, we removed redundant information and generated the final answers through the API, outputting results in standard JSON format. We optimized data recall with RAG and rerank techniques during retrieval and developed a hybrid matching scheme. By combining two-stage retrieval method with keyword matching via Jieba, we significantly enhanced the accuracy of model outputs.

Paper Structure

This paper contains 20 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An example of constructing chain-of-thought prompts in the TCM syndrome differentiation task involves integrating reasoning pathways into the diagnostic process. Compared with the original prompt, the chain-of-thought prompt adds the corresponding reasoning path for extracting clinical information, inferring pathomechanisms, and determining syndromes before outputting the final diagnosis, ensuring a systematic and logical diagnostic approach.
  • Figure 2: Illustration of a two-stage retrieval process in a Retrieval-Augmented Generation (RAG) framework for TCM diagnosis, integrating user queries with external knowledge to enhance reasoning and output accuracy.
  • Figure 3: Illustration of a preprocessing pipeline for structured and unstructured data, integrating OCR, text cleaning, jieba chunking, text segmentation, and vectorization for creating a searchable hybrid database.
  • Figure 4: Comparison of Diagnostic Reasoning with and without Retrieval-Augmented Generation (RAG) in Traditional Chinese Medicine Syndrome Differentiation.