Table of Contents
Fetching ...

LEAD: LLM-enhanced Engine for Author Disambiguation

Giusy Giulia Tuccari, Lorenzo Giammei, Andrea Giovanni Nuzzolese, Misael Mongiovì, Antonio Zinilli, Francesco Poggi

TL;DR

This work tackles cross-source author name disambiguation by linking Italian career records from CercaUniversità with Scopus profiles using LEAD, a hybrid pipeline that fuses semantic cues from LLMs with structural signals from co-authorship and citation networks. It demonstrates that a selective two-stage approach—primarily relying on bibliographic coupling and invoking LLM reasoning only on ambiguous cases—delivers superior accuracy (F1 ≈ 96.7%) and efficiency relative to full LLM methods. The study also clarifies the complementary roles of network-driven signals and semantic content, showing that bibliographic coupling provides strong baseline performance while LLMs offer targeted disambiguation in hard cases. The findings suggest that hybrid LLM-based strategies can substantially improve data quality for scientometric analyses, with potential applicability beyond the Italian context to broader cross-database entity resolution tasks.

Abstract

Author Name Disambiguation (AND) is a long-standing challenge in bibliometrics and scientometrics, as name ambiguity undermines the accuracy of bibliographic databases and the reliability of research evaluation. This study addresses the problem of cross-source disambiguation by linking academic career records from CercaUniversità, the official registry of Italian academics, with author profiles in Scopus. We introduce LEAD (LLM-enhanced Engine for Author Disambiguation), a novel hybrid framework that combines semantic features extracted through Large Language Models (LLMs) with structural evidence derived from co-authorship and citation networks. Using a gold standard of 606 ambiguous cases, we compare five methods: (i) Label Spreading on co-authorship networks; (ii) Bibliographic Coupling on citation networks; (iii) a standalone LLM-based approach; (iv) an LLM-enriched configuration; and (v) the proposed hybrid pipeline. LEAD achieves the best performance (F1 = 96.7%, accuracy = 95.7%) with lower computational cost than full LLM models. Bibliographic Coupling emerges as the fastest and strongest single-source method. These findings demonstrate that integrating semantic and structural signals within a selective hybrid strategy offers a robust and scalable solution to cross-database author identification. Beyond the Italian case, this work highlights the potential of hybrid LLM-based methods to improve data quality and reliability in scientometric analyses.

LEAD: LLM-enhanced Engine for Author Disambiguation

TL;DR

This work tackles cross-source author name disambiguation by linking Italian career records from CercaUniversità with Scopus profiles using LEAD, a hybrid pipeline that fuses semantic cues from LLMs with structural signals from co-authorship and citation networks. It demonstrates that a selective two-stage approach—primarily relying on bibliographic coupling and invoking LLM reasoning only on ambiguous cases—delivers superior accuracy (F1 ≈ 96.7%) and efficiency relative to full LLM methods. The study also clarifies the complementary roles of network-driven signals and semantic content, showing that bibliographic coupling provides strong baseline performance while LLMs offer targeted disambiguation in hard cases. The findings suggest that hybrid LLM-based strategies can substantially improve data quality for scientometric analyses, with potential applicability beyond the Italian context to broader cross-database entity resolution tasks.

Abstract

Author Name Disambiguation (AND) is a long-standing challenge in bibliometrics and scientometrics, as name ambiguity undermines the accuracy of bibliographic databases and the reliability of research evaluation. This study addresses the problem of cross-source disambiguation by linking academic career records from CercaUniversità, the official registry of Italian academics, with author profiles in Scopus. We introduce LEAD (LLM-enhanced Engine for Author Disambiguation), a novel hybrid framework that combines semantic features extracted through Large Language Models (LLMs) with structural evidence derived from co-authorship and citation networks. Using a gold standard of 606 ambiguous cases, we compare five methods: (i) Label Spreading on co-authorship networks; (ii) Bibliographic Coupling on citation networks; (iii) a standalone LLM-based approach; (iv) an LLM-enriched configuration; and (v) the proposed hybrid pipeline. LEAD achieves the best performance (F1 = 96.7%, accuracy = 95.7%) with lower computational cost than full LLM models. Bibliographic Coupling emerges as the fastest and strongest single-source method. These findings demonstrate that integrating semantic and structural signals within a selective hybrid strategy offers a robust and scalable solution to cross-database author identification. Beyond the Italian case, this work highlights the potential of hybrid LLM-based methods to improve data quality and reliability in scientometric analyses.

Paper Structure

This paper contains 17 sections, 2 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Processing workflow to generate the ground truth.
  • Figure 2: Prompt used for zero-shot author disambiguation.
  • Figure 3: Example of candidate's information.
  • Figure 4: Prompt used in the hybrid approach: the LLM receives both the candidate's textual metadata and additional evidence from Bibliographic Coupling and Label Spreading.