LEAD: LLM-enhanced Engine for Author Disambiguation
Giusy Giulia Tuccari, Lorenzo Giammei, Andrea Giovanni Nuzzolese, Misael Mongiovì, Antonio Zinilli, Francesco Poggi
TL;DR
This work tackles cross-source author name disambiguation by linking Italian career records from CercaUniversità with Scopus profiles using LEAD, a hybrid pipeline that fuses semantic cues from LLMs with structural signals from co-authorship and citation networks. It demonstrates that a selective two-stage approach—primarily relying on bibliographic coupling and invoking LLM reasoning only on ambiguous cases—delivers superior accuracy (F1 ≈ 96.7%) and efficiency relative to full LLM methods. The study also clarifies the complementary roles of network-driven signals and semantic content, showing that bibliographic coupling provides strong baseline performance while LLMs offer targeted disambiguation in hard cases. The findings suggest that hybrid LLM-based strategies can substantially improve data quality for scientometric analyses, with potential applicability beyond the Italian context to broader cross-database entity resolution tasks.
Abstract
Author Name Disambiguation (AND) is a long-standing challenge in bibliometrics and scientometrics, as name ambiguity undermines the accuracy of bibliographic databases and the reliability of research evaluation. This study addresses the problem of cross-source disambiguation by linking academic career records from CercaUniversità, the official registry of Italian academics, with author profiles in Scopus. We introduce LEAD (LLM-enhanced Engine for Author Disambiguation), a novel hybrid framework that combines semantic features extracted through Large Language Models (LLMs) with structural evidence derived from co-authorship and citation networks. Using a gold standard of 606 ambiguous cases, we compare five methods: (i) Label Spreading on co-authorship networks; (ii) Bibliographic Coupling on citation networks; (iii) a standalone LLM-based approach; (iv) an LLM-enriched configuration; and (v) the proposed hybrid pipeline. LEAD achieves the best performance (F1 = 96.7%, accuracy = 95.7%) with lower computational cost than full LLM models. Bibliographic Coupling emerges as the fastest and strongest single-source method. These findings demonstrate that integrating semantic and structural signals within a selective hybrid strategy offers a robust and scalable solution to cross-database author identification. Beyond the Italian case, this work highlights the potential of hybrid LLM-based methods to improve data quality and reliability in scientometric analyses.
