Entity Alignment with Noisy Annotations from Large Language Models
Shengyuan Chen, Qinggang Zhang, Junnan Dong, Wen Hua, Qing Li, Xiao Huang
TL;DR
The paper tackles cross-KG entity alignment under noisy LLM annotations and a fixed query budget. It introduces LLM4EA, which combines active source-entity selection with a probabilistic label refiner to filter noisy LLM-generated labels and iteratively train an EA model, guided by feedback from the base model. Empirical results on OpenEA show LLM4EA achieving state-of-the-art performance, with GPT-4 outperforming GPT-3.5, and a clear cost-efficiency advantage by leveraging cheaper LLMs with budget-aware tuning. The work demonstrates robust label refinement and budget-aware querying as practical approaches for scalable EA in noisy, real-world settings.
Abstract
Entity alignment (EA) aims to merge two knowledge graphs (KGs) by identifying equivalent entity pairs. While existing methods heavily rely on human-generated labels, it is prohibitively expensive to incorporate cross-domain experts for annotation in real-world scenarios. The advent of Large Language Models (LLMs) presents new avenues for automating EA with annotations, inspired by their comprehensive capability to process semantic information. However, it is nontrivial to directly apply LLMs for EA since the annotation space in real-world KGs is large. LLMs could also generate noisy labels that may mislead the alignment. To this end, we propose a unified framework, LLM4EA, to effectively leverage LLMs for EA. Specifically, we design a novel active learning policy to significantly reduce the annotation space by prioritizing the most valuable entities based on the entire inter-KG and intra-KG structure. Moreover, we introduce an unsupervised label refiner to continuously enhance label accuracy through in-depth probabilistic reasoning. We iteratively optimize the policy based on the feedback from a base EA model. Extensive experiments demonstrate the advantages of LLM4EA on four benchmark datasets in terms of effectiveness, robustness, and efficiency. Codes are available via https://github.com/chensyCN/llm4ea_official.
