Table of Contents
Fetching ...

Detection of metadata manipulations: Finding sneaked references in the scholarly literature

Lonni Besançon, Guillaume Cabanac, Cyril Labbé, Alexander Magazinov, Jules di Scala, Dominika Tkaczyk, Kathryn Weber-Boer

TL;DR

The paper investigates sneaked references—metadata-only citations that are not present in the published reference list—and documents a substantial instance in IJISRT. It develops two automated detection approaches, M1 and M2, to contrast Crossref metadata against text-derived references from PDFs, with a baseline M0 for lower-bound estimation. Using a large-scale dataset of 47,170,721 documents and 2,782 Crossref records, it identifies 80,205 sneaked references, with some papers accumulating thousands of undue citations (e.g., 6,059) benefiting IJISRT. The work highlights metadata vulnerabilities in scholarly systems and suggests practical strategies for validation and scale-up to curb citation gaming.

Abstract

We report evidence of a new set of sneaked references discovered in the scientific literature. Sneaked references are references registered in the metadata of publications without being listed in reference section or in the full text of the actual publications where they ought to be found. We document here 80,205 references sneaked in metadata of the International Journal of Innovative Science and Research Technology (IJISRT). These sneaked references are registered with Crossref and all cite -- thus benefit -- this same journal. Using this dataset, we evaluate three different methods to automatically identify sneaked references. These methods compare reference lists registered with Crossref against the full text or the reference lists extracted from PDF files. In addition, we report attempts to scale the search for sneaked references to the scholarly literature.

Detection of metadata manipulations: Finding sneaked references in the scholarly literature

TL;DR

The paper investigates sneaked references—metadata-only citations that are not present in the published reference list—and documents a substantial instance in IJISRT. It develops two automated detection approaches, M1 and M2, to contrast Crossref metadata against text-derived references from PDFs, with a baseline M0 for lower-bound estimation. Using a large-scale dataset of 47,170,721 documents and 2,782 Crossref records, it identifies 80,205 sneaked references, with some papers accumulating thousands of undue citations (e.g., 6,059) benefiting IJISRT. The work highlights metadata vulnerabilities in scholarly systems and suggests practical strategies for validation and scale-up to curb citation gaming.

Abstract

We report evidence of a new set of sneaked references discovered in the scientific literature. Sneaked references are references registered in the metadata of publications without being listed in reference section or in the full text of the actual publications where they ought to be found. We document here 80,205 references sneaked in metadata of the International Journal of Innovative Science and Research Technology (IJISRT). These sneaked references are registered with Crossref and all cite -- thus benefit -- this same journal. Using this dataset, we evaluate three different methods to automatically identify sneaked references. These methods compare reference lists registered with Crossref against the full text or the reference lists extracted from PDF files. In addition, we report attempts to scale the search for sneaked references to the scholarly literature.
Paper Structure (4 sections, 3 figures)

This paper contains 4 sections, 3 figures.

Figures (3)

  • Figure 1: The citation count of https://doi.org/10.38124/ijisrt/ijisrt24apr651 is 1.7k according to https://app.dimensions.ai/discover/publication?search_mode=content&search_text=10.38124%2Fijisrt%2Fijisrt24apr2251&search_type=kws&search_field=doi: Early Dec. 2024 it benefits from at least $6,059$ sneaked references (see \ref{['fig:BenfBarChart']}). There is no reason to think that authors are responsible for this discrepancy.
  • Figure 2: The citation count of https://doi.org/10.38124/ijisrt/ijisrt24apr651 is 1.8k according to https://openalex.org/works/w4396228980: Early Dec. 2024 it benefits from at least $6,059$ sneaked references (see \ref{['fig:BenfBarChart']}). There is no reason to think that authors are responsible for this discrepancy.
  • Figure 3: A PDF file with a list of 8 references. The reference list extracted by Grobid ($\mathcal{R}_{G}$) does not contain some of the expected references (e.g., references #1, #4, and #5) and does feature non-existing references (e.g., references $\mathcal{R}_{G}$ [6.]). The reference list registered with Crossref ($\mathcal{R}_{C}$) contains 3 sneaked references: [9.], [10.], [11.]. This is a Case 3 situation.