Table of Contents
Fetching ...

RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs in Medicine

Jiatan Huang, Mingchen Li, Zonghai Yao, Dawei Li, Yuxin Zhang, Zhichao Yang, Yongkang Xiao, Feiyun Ouyang, Xiaohan Li, Shuo Han, Hong Yu

TL;DR

RiTeK introduces a large-scale, expert-validated benchmark for complex reasoning over medical Textual Knowledge Graphs, integrating rich textual descriptions with diverse topologies to challenge LLM-driven retrieval. The dataset builds two medical TKGs (PharmKG and ADInt) and corresponding QA benchmarks (RiTeK-PharmKG and RiTeK-ADint) across six reasoning topologies, using a five-step construction pipeline that fuses relational templates with entity texts via GPT-4 and validates ground-truth answers with multiple LLMs and medical experts. A comprehensive evaluation of 11 retrieval models across zero-shot and few-shot settings reveals significant gaps in current methods, with retrieval-augmented approaches (notably KAR, GCR, and TOG variants) providing the strongest performance and highlighting the critical role of combining textual evidence with structured relations in medical domains. The work establishes RiTeK as a standard for assessing semi-structured medical knowledge retrieval and motivates future advances in robust, topology-aware retrieval systems capable of leveraging rich textual properties and complex ontologies.

Abstract

Answering complex real-world questions in the medical domain often requires accurate retrieval from medical Textual Knowledge Graphs (medical TKGs), as the relational path information from TKGs could enhance the inference ability of Large Language Models (LLMs). However, the main bottlenecks lie in the scarcity of existing medical TKGs, the limited expressiveness of their topological structures, and the lack of comprehensive evaluations of current retrievers for medical TKGs. To address these challenges, we first develop a Dataset1 for LLMs Complex Reasoning over medical Textual Knowledge Graphs (RiTeK), covering a broad range of topological structures. Specifically, we synthesize realistic user queries integrating diverse topological structures, relational information, and complex textual descriptions. We conduct a rigorous medical expert evaluation process to assess and validate the quality of our synthesized queries. RiTeK also serves as a comprehensive benchmark dataset for evaluating the capabilities of retrieval systems built upon LLMs. By assessing 11 representative retrievers on this benchmark, we observe that existing methods struggle to perform well, revealing notable limitations in current LLM-driven retrieval approaches. These findings highlight the pressing need for more effective retrieval systems tailored for semi-structured data in the medical domain.

RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs in Medicine

TL;DR

RiTeK introduces a large-scale, expert-validated benchmark for complex reasoning over medical Textual Knowledge Graphs, integrating rich textual descriptions with diverse topologies to challenge LLM-driven retrieval. The dataset builds two medical TKGs (PharmKG and ADInt) and corresponding QA benchmarks (RiTeK-PharmKG and RiTeK-ADint) across six reasoning topologies, using a five-step construction pipeline that fuses relational templates with entity texts via GPT-4 and validates ground-truth answers with multiple LLMs and medical experts. A comprehensive evaluation of 11 retrieval models across zero-shot and few-shot settings reveals significant gaps in current methods, with retrieval-augmented approaches (notably KAR, GCR, and TOG variants) providing the strongest performance and highlighting the critical role of combining textual evidence with structured relations in medical domains. The work establishes RiTeK as a standard for assessing semi-structured medical knowledge retrieval and motivates future advances in robust, topology-aware retrieval systems capable of leveraging rich textual properties and complex ontologies.

Abstract

Answering complex real-world questions in the medical domain often requires accurate retrieval from medical Textual Knowledge Graphs (medical TKGs), as the relational path information from TKGs could enhance the inference ability of Large Language Models (LLMs). However, the main bottlenecks lie in the scarcity of existing medical TKGs, the limited expressiveness of their topological structures, and the lack of comprehensive evaluations of current retrievers for medical TKGs. To address these challenges, we first develop a Dataset1 for LLMs Complex Reasoning over medical Textual Knowledge Graphs (RiTeK), covering a broad range of topological structures. Specifically, we synthesize realistic user queries integrating diverse topological structures, relational information, and complex textual descriptions. We conduct a rigorous medical expert evaluation process to assess and validate the quality of our synthesized queries. RiTeK also serves as a comprehensive benchmark dataset for evaluating the capabilities of retrieval systems built upon LLMs. By assessing 11 representative retrievers on this benchmark, we observe that existing methods struggle to perform well, revealing notable limitations in current LLM-driven retrieval approaches. These findings highlight the pressing need for more effective retrieval systems tailored for semi-structured data in the medical domain.

Paper Structure

This paper contains 35 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The process of constructing textual structured retrieval datasets involves five main steps, 1) Relational template construction: Create the relation template for TKG using the expert-designed topological structure. 2) Extract Textual Properties: Choose one node as the answer node that meets the relational requirement, and extract relevant textual properties. 3) Combine Information: Merge the relational information and textual properties to form a natural-sounding query. 4) Filtering additional answers: Check if the left nodes satisfy the textual properties to establish other ground truth nodes. 5) Expert Evaluation: The medical experts evaluate the naturalness, diversity, and practicality of the dataset.
  • Figure 2: A case study on RiTeK
  • Figure 3: Distribution of query lengths and answer lengths on RiTeK-ADint and RiTeK-PharmKG datasets