Inferring Scientific Cross-Document Coreference and Hierarchy with Definition-Augmented Relational Reasoning
Lior Forer, Tom Hope
TL;DR
This work tackles cross-document coreference and hierarchy in scientific texts by introducing SciCo-Radar, a retrieval-driven framework that generates context-dependent singleton definitions and explicit relational definitions to augment concept mentions. A two-stage re-ranking mechanism limits the combinatorial explosion inherent in pairwise relational reasoning, enabling scalable use of large language models in both fine-tuning and in-context learning settings on SciCo. Empirical results show substantial improvements over baselines, especially on hard subsets with high surface-form variation and ambiguity, with relational definitions delivering the strongest gains. The paper also provides qualitative analyses of when and why these dynamic definitions help, discusses limitations, and outlines directions for increasing efficiency and extending the approach to broader cross-document reasoning tasks in science.
Abstract
We address the fundamental task of inferring cross-document coreference and hierarchy in scientific texts, which has important applications in knowledge graph construction, search, recommendation and discovery. Large Language Models (LLMs) can struggle when faced with many long-tail technical concepts with nuanced variations. We present a novel method which generates context-dependent definitions of concept mentions by retrieving full-text literature, and uses the definitions to enhance detection of cross-document relations. We further generate relational definitions, which describe how two concept mentions are related or different, and design an efficient re-ranking approach to address the combinatorial explosion involved in inferring links across papers. In both fine-tuning and in-context learning settings, we achieve large gains in performance on data subsets with high amount of different surfaces forms and ambiguity, that are challenging for models. We provide analysis of generated definitions, shedding light on the relational reasoning ability of LLMs over fine-grained scientific concepts.
