A Content-Based Novelty Measure for Scholarly Publications: A Proof of Concept

Haining Wang

A Content-Based Novelty Measure for Scholarly Publications: A Proof of Concept

Haining Wang

TL;DR

This paper proposes a surprisal-based, information-theoretic framework to quantify scholarly novelty from content. By modeling a proxy distribution $Q$ of scholarly discourse with a GPT-2-based language model trained on Wikipedia, it measures the novelty of a manuscript via the cumulative surprisal of its word sequence, $I(x_1,...,x_n) = -\sum_{i=1}^{n} \log Q(x_i|x_{1:i-1})$. The approach includes face validity tests (token- and sentence-level) and construct validity tests using the Authorship Verification Anthology (AVA) dataset, demonstrating that novel texts yield higher surprisal and that the measure can distinguish known groups. The method is open-source with accessible code and data, and the authors discuss limitations and directions for broader, multilingual validation and domain-specific calibration.

Abstract

Novelty, akin to gene mutation in evolution, opens possibilities for scholarly advancement. Although peer review remains the gold standard for evaluating novelty in scholarly communication and resource allocation, the vast volume of submissions necessitates an automated measure of scholarly novelty. Adopting a perspective that views novelty as the atypical combination of existing knowledge, we introduce an information-theoretic measure of novelty in scholarly publications. This measure quantifies the degree of 'surprise' perceived by a language model that represents the word distribution of scholarly discourse. The proposed measure is accompanied by face and construct validity evidence; the former demonstrates correspondence to scientific common sense, and the latter is endorsed through alignment with novelty evaluations from a select panel of domain experts. Additionally, characterized by its interpretability, fine granularity, and accessibility, this measure addresses gaps prevalent in existing methods. We believe this measure holds great potential to benefit editors, stakeholders, and policymakers, and it provides a reliable lens for examining the relationship between novelty and academic dynamics such as creativity, interdisciplinarity, and scientific advances.

A Content-Based Novelty Measure for Scholarly Publications: A Proof of Concept

TL;DR

This paper proposes a surprisal-based, information-theoretic framework to quantify scholarly novelty from content. By modeling a proxy distribution

of scholarly discourse with a GPT-2-based language model trained on Wikipedia, it measures the novelty of a manuscript via the cumulative surprisal of its word sequence,

. The approach includes face validity tests (token- and sentence-level) and construct validity tests using the Authorship Verification Anthology (AVA) dataset, demonstrating that novel texts yield higher surprisal and that the measure can distinguish known groups. The method is open-source with accessible code and data, and the authors discuss limitations and directions for broader, multilingual validation and domain-specific calibration.

Abstract

Paper Structure (17 sections, 3 equations, 2 figures)

This paper contains 17 sections, 3 equations, 2 figures.

Introduction
Literature Review
Novelty Measures Based on Reference, Keyword, Title, and Abstract
Novelty Measures Based on Content
Considerations for Designing a Novelty Measure
An Information-Theoretical Understanding of Novelty in Scholarly Publications
Approximating Scholarly Discourse With a Language Model
Language Modeling
Modeling Scholarly Discourse
Notes on Surprisal Calculation
Assessing Face Validity Using Illustrative Examples
A Token-Level Examination
A Sentence-Level Examination
Assessing Construct Validity Using Known Groups
Authorship Verification Anthology Dataset
...and 2 more sections

Figures (2)

Figure 1: Illustrative examples showing the probability rankings of potential subsequent tokens given a common preceding context. Based on the same history, the GPT-2 model, trained exclusively on English Wikipedia, predicts the next token as 'room' with a probability of 0.05 and 'low' with a probability of 0.17, indicating that room-temperature observation of quantum states is more novel.
Figure 2: The left panel presents the distribution of text lengths in the AVA dataset, measured in words. The right panel shows box plots of the average surprisal scores for two groups of papers, one of which is unanimously deemed more novel according to assessments by domain experts.

A Content-Based Novelty Measure for Scholarly Publications: A Proof of Concept

TL;DR

Abstract

A Content-Based Novelty Measure for Scholarly Publications: A Proof of Concept

TL;DR

Abstract

Table of Contents

Figures (2)