Table of Contents
Fetching ...

LitFM: A Retrieval Augmented Structure-aware Foundation Model For Citation Graphs

Jiasheng Zhang, Jialin Chen, Ali Maatouk, Ngoc Bui, Qianqian Xie, Leandros Tassiulas, Jie Shao, Hua Xu, Rex Ying

TL;DR

LitFM addresses hallucinations and task fragmentation in literature tasks by grounding a knowledge-infused LLM in a domain-specific citation graph through a novel neighbor-aware graph retriever and multi-task instruction tuning. It unifies node- and edge-level tasks, introduces pseudo-query embeddings and diversity-aware retrieval to combat the Matthew effect, and uses chain-of-thought strategies for complex tasks like related-work generation. Across three domains, LitFM achieves state-of-the-art retrieval precision (≈28.1% gains) and superior performance on six literature-related tasks, including citation link prediction, paper recommendation, and related-work generation. The approach is validated on large, richly annotated datasets and is open-sourced with a user interface to facilitate adoption in research workflows.

Abstract

With the advent of large language models (LLMs), managing scientific literature via LLMs has become a promising direction of research. However, existing approaches often overlook the rich structural and semantic relevance among scientific literature, limiting their ability to discern the relationships between pieces of scientific knowledge, and suffer from various types of hallucinations. These methods also focus narrowly on individual downstream tasks, limiting their applicability across use cases. Here we propose LitFM, the first literature foundation model designed for a wide variety of practical downstream tasks on domain-specific literature, with a focus on citation information. At its core, LitFM contains a novel graph retriever to integrate graph structure by navigating citation graphs and extracting relevant literature, thereby enhancing model reliability. LitFM also leverages a knowledge-infused LLM, fine-tuned through a well-developed instruction paradigm. It enables LitFM to extract domain-specific knowledge from literature and reason relationships among them. By integrating citation graphs during both training and inference, LitFM can generalize to unseen papers and accurately assess their relevance within existing literature. Additionally, we introduce new large-scale literature citation benchmark datasets on three academic fields, featuring sentence-level citation information and local context. Extensive experiments validate the superiority of LitFM, achieving 28.1% improvement on retrieval task in precision, and an average improvement of 7.52% over state-of-the-art across six downstream literature-related tasks

LitFM: A Retrieval Augmented Structure-aware Foundation Model For Citation Graphs

TL;DR

LitFM addresses hallucinations and task fragmentation in literature tasks by grounding a knowledge-infused LLM in a domain-specific citation graph through a novel neighbor-aware graph retriever and multi-task instruction tuning. It unifies node- and edge-level tasks, introduces pseudo-query embeddings and diversity-aware retrieval to combat the Matthew effect, and uses chain-of-thought strategies for complex tasks like related-work generation. Across three domains, LitFM achieves state-of-the-art retrieval precision (≈28.1% gains) and superior performance on six literature-related tasks, including citation link prediction, paper recommendation, and related-work generation. The approach is validated on large, richly annotated datasets and is open-sourced with a user interface to facilitate adoption in research workflows.

Abstract

With the advent of large language models (LLMs), managing scientific literature via LLMs has become a promising direction of research. However, existing approaches often overlook the rich structural and semantic relevance among scientific literature, limiting their ability to discern the relationships between pieces of scientific knowledge, and suffer from various types of hallucinations. These methods also focus narrowly on individual downstream tasks, limiting their applicability across use cases. Here we propose LitFM, the first literature foundation model designed for a wide variety of practical downstream tasks on domain-specific literature, with a focus on citation information. At its core, LitFM contains a novel graph retriever to integrate graph structure by navigating citation graphs and extracting relevant literature, thereby enhancing model reliability. LitFM also leverages a knowledge-infused LLM, fine-tuned through a well-developed instruction paradigm. It enables LitFM to extract domain-specific knowledge from literature and reason relationships among them. By integrating citation graphs during both training and inference, LitFM can generalize to unseen papers and accurately assess their relevance within existing literature. Additionally, we introduce new large-scale literature citation benchmark datasets on three academic fields, featuring sentence-level citation information and local context. Extensive experiments validate the superiority of LitFM, achieving 28.1% improvement on retrieval task in precision, and an average improvement of 7.52% over state-of-the-art across six downstream literature-related tasks
Paper Structure (20 sections, 6 equations, 8 figures, 9 tables)

This paper contains 20 sections, 6 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: (a) Hallucinations faced by existing approaches. (b) Performance of LitFM and existing approaches on six benchmark tasks, where LitFM shows consistent superiority.
  • Figure 2: Main components of LitFM. (A) We curate citation graph benchmark datasets with enriched citation context based on which the domain-specific sets for citation instruction tuning are constructed. (B) Graph retriever with query reconstruction. Self-supervised pre-training is employed to adapt to domain properties. The topic-control strategy and the re-ranking are proposed to enhance the retrieval diversity. (C) Instruction paradigm for citation graph understanding, which infuses domain-specific knowledge into LLMs. (D) The retriever-augmented pipeline uniformly manages various literature-related tasks.
  • Figure 3: (a) Link prediction performance of existing LLMs with and without knowledge-infused fine-tuning. (b) Performance of LitFM with and without graph retriever.
  • Figure 4: (a) Performance of LitFM when using different retrieving approaches. (b) The performance of LitFM with different number of augmentation papers.
  • Figure 5: Performance of existing LLMs with G-Retriever on citation graph tasks.
  • ...and 3 more figures