A Content-Based Novelty Measure for Scholarly Publications: A Proof of Concept
Haining Wang
TL;DR
This paper proposes a surprisal-based, information-theoretic framework to quantify scholarly novelty from content. By modeling a proxy distribution $Q$ of scholarly discourse with a GPT-2-based language model trained on Wikipedia, it measures the novelty of a manuscript via the cumulative surprisal of its word sequence, $I(x_1,...,x_n) = -\sum_{i=1}^{n} \log Q(x_i|x_{1:i-1})$. The approach includes face validity tests (token- and sentence-level) and construct validity tests using the Authorship Verification Anthology (AVA) dataset, demonstrating that novel texts yield higher surprisal and that the measure can distinguish known groups. The method is open-source with accessible code and data, and the authors discuss limitations and directions for broader, multilingual validation and domain-specific calibration.
Abstract
Novelty, akin to gene mutation in evolution, opens possibilities for scholarly advancement. Although peer review remains the gold standard for evaluating novelty in scholarly communication and resource allocation, the vast volume of submissions necessitates an automated measure of scholarly novelty. Adopting a perspective that views novelty as the atypical combination of existing knowledge, we introduce an information-theoretic measure of novelty in scholarly publications. This measure quantifies the degree of 'surprise' perceived by a language model that represents the word distribution of scholarly discourse. The proposed measure is accompanied by face and construct validity evidence; the former demonstrates correspondence to scientific common sense, and the latter is endorsed through alignment with novelty evaluations from a select panel of domain experts. Additionally, characterized by its interpretability, fine granularity, and accessibility, this measure addresses gaps prevalent in existing methods. We believe this measure holds great potential to benefit editors, stakeholders, and policymakers, and it provides a reliable lens for examining the relationship between novelty and academic dynamics such as creativity, interdisciplinarity, and scientific advances.
