Table of Contents
Fetching ...

Scholar Name Disambiguation with Search-enhanced LLM Across Language

Renyu Zhao, Yunxin Chen

TL;DR

This work addresses the challenge of scholar name disambiguation under multilingual and heterogeneous data conditions by introducing a search-enhanced LLM pipeline. The approach uses a Retrieval-Augmented Generation style framework with three agents to extract scholarly profiles, retrieve native names, and compare profiles across languages, leveraging cross-language search, query rewriting, and data indexing. Experimental results show substantial gains when incorporating native-language information, with GPT-4o achieving near-human disambiguation accuracy (≈98%) and notable improvements over baselines across Chinese and non-Chinese settings. The method offers a practical, scalable pathway for accurate scholar identity resolution in global academic ecosystems and has potential applications in awards, anti-fraud, and CV/profile extraction tasks.

Abstract

The task of scholar name disambiguation is crucial in various real-world scenarios, including bibliometric-based candidate evaluation for awards, application material anti-fraud measures, and more. Despite significant advancements, current methods face limitations due to the complexity of heterogeneous data, often necessitating extensive human intervention. This paper proposes a novel approach by leveraging search-enhanced language models across multiple languages to improve name disambiguation. By utilizing the powerful query rewriting, intent recognition, and data indexing capabilities of search engines, our method can gather richer information for distinguishing between entities and extracting profiles, resulting in a more comprehensive data dimension. Given the strong cross-language capabilities of large language models(LLMs), optimizing enhanced retrieval methods with this technology offers substantial potential for high-efficiency information retrieval and utilization. Our experiments demonstrate that incorporating local languages significantly enhances disambiguation performance, particularly for scholars from diverse geographic regions. This multi-lingual, search-enhanced methodology offers a promising direction for more efficient and accurate active scholar name disambiguation.

Scholar Name Disambiguation with Search-enhanced LLM Across Language

TL;DR

This work addresses the challenge of scholar name disambiguation under multilingual and heterogeneous data conditions by introducing a search-enhanced LLM pipeline. The approach uses a Retrieval-Augmented Generation style framework with three agents to extract scholarly profiles, retrieve native names, and compare profiles across languages, leveraging cross-language search, query rewriting, and data indexing. Experimental results show substantial gains when incorporating native-language information, with GPT-4o achieving near-human disambiguation accuracy (≈98%) and notable improvements over baselines across Chinese and non-Chinese settings. The method offers a practical, scalable pathway for accurate scholar identity resolution in global academic ecosystems and has potential applications in awards, anti-fraud, and CV/profile extraction tasks.

Abstract

The task of scholar name disambiguation is crucial in various real-world scenarios, including bibliometric-based candidate evaluation for awards, application material anti-fraud measures, and more. Despite significant advancements, current methods face limitations due to the complexity of heterogeneous data, often necessitating extensive human intervention. This paper proposes a novel approach by leveraging search-enhanced language models across multiple languages to improve name disambiguation. By utilizing the powerful query rewriting, intent recognition, and data indexing capabilities of search engines, our method can gather richer information for distinguishing between entities and extracting profiles, resulting in a more comprehensive data dimension. Given the strong cross-language capabilities of large language models(LLMs), optimizing enhanced retrieval methods with this technology offers substantial potential for high-efficiency information retrieval and utilization. Our experiments demonstrate that incorporating local languages significantly enhances disambiguation performance, particularly for scholars from diverse geographic regions. This multi-lingual, search-enhanced methodology offers a promising direction for more efficient and accurate active scholar name disambiguation.

Paper Structure

This paper contains 14 sections, 4 equations, 1 figure, 10 tables.

Figures (1)

  • Figure 1: The workflow of our disambiguation method.