Table of Contents
Fetching ...

A Content-Based Novelty Measure for Scholarly Publications: A Proof of Concept

Haining Wang

TL;DR

This paper proposes a surprisal-based, information-theoretic framework to quantify scholarly novelty from content. By modeling a proxy distribution $Q$ of scholarly discourse with a GPT-2-based language model trained on Wikipedia, it measures the novelty of a manuscript via the cumulative surprisal of its word sequence, $I(x_1,...,x_n) = -\sum_{i=1}^{n} \log Q(x_i|x_{1:i-1})$. The approach includes face validity tests (token- and sentence-level) and construct validity tests using the Authorship Verification Anthology (AVA) dataset, demonstrating that novel texts yield higher surprisal and that the measure can distinguish known groups. The method is open-source with accessible code and data, and the authors discuss limitations and directions for broader, multilingual validation and domain-specific calibration.

Abstract

Novelty, akin to gene mutation in evolution, opens possibilities for scholarly advancement. Although peer review remains the gold standard for evaluating novelty in scholarly communication and resource allocation, the vast volume of submissions necessitates an automated measure of scholarly novelty. Adopting a perspective that views novelty as the atypical combination of existing knowledge, we introduce an information-theoretic measure of novelty in scholarly publications. This measure quantifies the degree of 'surprise' perceived by a language model that represents the word distribution of scholarly discourse. The proposed measure is accompanied by face and construct validity evidence; the former demonstrates correspondence to scientific common sense, and the latter is endorsed through alignment with novelty evaluations from a select panel of domain experts. Additionally, characterized by its interpretability, fine granularity, and accessibility, this measure addresses gaps prevalent in existing methods. We believe this measure holds great potential to benefit editors, stakeholders, and policymakers, and it provides a reliable lens for examining the relationship between novelty and academic dynamics such as creativity, interdisciplinarity, and scientific advances.

A Content-Based Novelty Measure for Scholarly Publications: A Proof of Concept

TL;DR

This paper proposes a surprisal-based, information-theoretic framework to quantify scholarly novelty from content. By modeling a proxy distribution of scholarly discourse with a GPT-2-based language model trained on Wikipedia, it measures the novelty of a manuscript via the cumulative surprisal of its word sequence, . The approach includes face validity tests (token- and sentence-level) and construct validity tests using the Authorship Verification Anthology (AVA) dataset, demonstrating that novel texts yield higher surprisal and that the measure can distinguish known groups. The method is open-source with accessible code and data, and the authors discuss limitations and directions for broader, multilingual validation and domain-specific calibration.

Abstract

Novelty, akin to gene mutation in evolution, opens possibilities for scholarly advancement. Although peer review remains the gold standard for evaluating novelty in scholarly communication and resource allocation, the vast volume of submissions necessitates an automated measure of scholarly novelty. Adopting a perspective that views novelty as the atypical combination of existing knowledge, we introduce an information-theoretic measure of novelty in scholarly publications. This measure quantifies the degree of 'surprise' perceived by a language model that represents the word distribution of scholarly discourse. The proposed measure is accompanied by face and construct validity evidence; the former demonstrates correspondence to scientific common sense, and the latter is endorsed through alignment with novelty evaluations from a select panel of domain experts. Additionally, characterized by its interpretability, fine granularity, and accessibility, this measure addresses gaps prevalent in existing methods. We believe this measure holds great potential to benefit editors, stakeholders, and policymakers, and it provides a reliable lens for examining the relationship between novelty and academic dynamics such as creativity, interdisciplinarity, and scientific advances.
Paper Structure (17 sections, 3 equations, 2 figures)

This paper contains 17 sections, 3 equations, 2 figures.

Figures (2)

  • Figure 1: Illustrative examples showing the probability rankings of potential subsequent tokens given a common preceding context. Based on the same history, the GPT-2 model, trained exclusively on English Wikipedia, predicts the next token as 'room' with a probability of 0.05 and 'low' with a probability of 0.17, indicating that room-temperature observation of quantum states is more novel.
  • Figure 2: The left panel presents the distribution of text lengths in the AVA dataset, measured in words. The right panel shows box plots of the average surprisal scores for two groups of papers, one of which is unanimously deemed more novel according to assessments by domain experts.