Table of Contents
Fetching ...

Inferring Scientific Cross-Document Coreference and Hierarchy with Definition-Augmented Relational Reasoning

Lior Forer, Tom Hope

TL;DR

This work tackles cross-document coreference and hierarchy in scientific texts by introducing SciCo-Radar, a retrieval-driven framework that generates context-dependent singleton definitions and explicit relational definitions to augment concept mentions. A two-stage re-ranking mechanism limits the combinatorial explosion inherent in pairwise relational reasoning, enabling scalable use of large language models in both fine-tuning and in-context learning settings on SciCo. Empirical results show substantial improvements over baselines, especially on hard subsets with high surface-form variation and ambiguity, with relational definitions delivering the strongest gains. The paper also provides qualitative analyses of when and why these dynamic definitions help, discusses limitations, and outlines directions for increasing efficiency and extending the approach to broader cross-document reasoning tasks in science.

Abstract

We address the fundamental task of inferring cross-document coreference and hierarchy in scientific texts, which has important applications in knowledge graph construction, search, recommendation and discovery. Large Language Models (LLMs) can struggle when faced with many long-tail technical concepts with nuanced variations. We present a novel method which generates context-dependent definitions of concept mentions by retrieving full-text literature, and uses the definitions to enhance detection of cross-document relations. We further generate relational definitions, which describe how two concept mentions are related or different, and design an efficient re-ranking approach to address the combinatorial explosion involved in inferring links across papers. In both fine-tuning and in-context learning settings, we achieve large gains in performance on data subsets with high amount of different surfaces forms and ambiguity, that are challenging for models. We provide analysis of generated definitions, shedding light on the relational reasoning ability of LLMs over fine-grained scientific concepts.

Inferring Scientific Cross-Document Coreference and Hierarchy with Definition-Augmented Relational Reasoning

TL;DR

This work tackles cross-document coreference and hierarchy in scientific texts by introducing SciCo-Radar, a retrieval-driven framework that generates context-dependent singleton definitions and explicit relational definitions to augment concept mentions. A two-stage re-ranking mechanism limits the combinatorial explosion inherent in pairwise relational reasoning, enabling scalable use of large language models in both fine-tuning and in-context learning settings on SciCo. Empirical results show substantial improvements over baselines, especially on hard subsets with high surface-form variation and ambiguity, with relational definitions delivering the strongest gains. The paper also provides qualitative analyses of when and why these dynamic definitions help, discusses limitations, and outlines directions for increasing efficiency and extending the approach to broader cross-document reasoning tasks in science.

Abstract

We address the fundamental task of inferring cross-document coreference and hierarchy in scientific texts, which has important applications in knowledge graph construction, search, recommendation and discovery. Large Language Models (LLMs) can struggle when faced with many long-tail technical concepts with nuanced variations. We present a novel method which generates context-dependent definitions of concept mentions by retrieving full-text literature, and uses the definitions to enhance detection of cross-document relations. We further generate relational definitions, which describe how two concept mentions are related or different, and design an efficient re-ranking approach to address the combinatorial explosion involved in inferring links across papers. In both fine-tuning and in-context learning settings, we achieve large gains in performance on data subsets with high amount of different surfaces forms and ambiguity, that are challenging for models. We provide analysis of generated definitions, shedding light on the relational reasoning ability of LLMs over fine-grained scientific concepts.
Paper Structure (45 sections, 1 equation, 9 figures, 19 tables)

This paper contains 45 sections, 1 equation, 9 figures, 19 tables.

Figures (9)

  • Figure 1: We detect cross-document coreference and hierarchy by augmenting original inputs from papers with context-sensitive definitions and relational reasoning.
  • Figure 2: Overview of SciCo-Radar. We are given as input papers with concept mentions (e.g., methods and tasks). (1) For each mention, we first create singleton definitions by retrieving relevant literature and using an LLM to generate context-dependent concept definitions. These definitions are used to augment the original inputs. We use the augmented input to train an LLM to detect cross-document coreference and hierarchy. (2) Using the model trained with singleton definitions, we rank promising top-K candidates, and create for them relational definitions that explicitly reason about pairwise concept relationships and further augment the input to enhance detection.
  • Figure 3: Logic and structure of the prompt used to generate singleton and relational definitions.
  • Figure 4: Singleton definition generation prompt.
  • Figure 5: Relational definition generation prompt.
  • ...and 4 more figures