Table of Contents
Fetching ...

Entity Alignment with Noisy Annotations from Large Language Models

Shengyuan Chen, Qinggang Zhang, Junnan Dong, Wen Hua, Qing Li, Xiao Huang

TL;DR

The paper tackles cross-KG entity alignment under noisy LLM annotations and a fixed query budget. It introduces LLM4EA, which combines active source-entity selection with a probabilistic label refiner to filter noisy LLM-generated labels and iteratively train an EA model, guided by feedback from the base model. Empirical results on OpenEA show LLM4EA achieving state-of-the-art performance, with GPT-4 outperforming GPT-3.5, and a clear cost-efficiency advantage by leveraging cheaper LLMs with budget-aware tuning. The work demonstrates robust label refinement and budget-aware querying as practical approaches for scalable EA in noisy, real-world settings.

Abstract

Entity alignment (EA) aims to merge two knowledge graphs (KGs) by identifying equivalent entity pairs. While existing methods heavily rely on human-generated labels, it is prohibitively expensive to incorporate cross-domain experts for annotation in real-world scenarios. The advent of Large Language Models (LLMs) presents new avenues for automating EA with annotations, inspired by their comprehensive capability to process semantic information. However, it is nontrivial to directly apply LLMs for EA since the annotation space in real-world KGs is large. LLMs could also generate noisy labels that may mislead the alignment. To this end, we propose a unified framework, LLM4EA, to effectively leverage LLMs for EA. Specifically, we design a novel active learning policy to significantly reduce the annotation space by prioritizing the most valuable entities based on the entire inter-KG and intra-KG structure. Moreover, we introduce an unsupervised label refiner to continuously enhance label accuracy through in-depth probabilistic reasoning. We iteratively optimize the policy based on the feedback from a base EA model. Extensive experiments demonstrate the advantages of LLM4EA on four benchmark datasets in terms of effectiveness, robustness, and efficiency. Codes are available via https://github.com/chensyCN/llm4ea_official.

Entity Alignment with Noisy Annotations from Large Language Models

TL;DR

The paper tackles cross-KG entity alignment under noisy LLM annotations and a fixed query budget. It introduces LLM4EA, which combines active source-entity selection with a probabilistic label refiner to filter noisy LLM-generated labels and iteratively train an EA model, guided by feedback from the base model. Empirical results on OpenEA show LLM4EA achieving state-of-the-art performance, with GPT-4 outperforming GPT-3.5, and a clear cost-efficiency advantage by leveraging cheaper LLMs with budget-aware tuning. The work demonstrates robust label refinement and budget-aware querying as practical approaches for scalable EA in noisy, real-world settings.

Abstract

Entity alignment (EA) aims to merge two knowledge graphs (KGs) by identifying equivalent entity pairs. While existing methods heavily rely on human-generated labels, it is prohibitively expensive to incorporate cross-domain experts for annotation in real-world scenarios. The advent of Large Language Models (LLMs) presents new avenues for automating EA with annotations, inspired by their comprehensive capability to process semantic information. However, it is nontrivial to directly apply LLMs for EA since the annotation space in real-world KGs is large. LLMs could also generate noisy labels that may mislead the alignment. To this end, we propose a unified framework, LLM4EA, to effectively leverage LLMs for EA. Specifically, we design a novel active learning policy to significantly reduce the annotation space by prioritizing the most valuable entities based on the entire inter-KG and intra-KG structure. Moreover, we introduce an unsupervised label refiner to continuously enhance label accuracy through in-depth probabilistic reasoning. We iteratively optimize the policy based on the feedback from a base EA model. Extensive experiments demonstrate the advantages of LLM4EA on four benchmark datasets in terms of effectiveness, robustness, and efficiency. Codes are available via https://github.com/chensyCN/llm4ea_official.
Paper Structure (28 sections, 12 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 28 sections, 12 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the LLM4EA framework. LLM4EA utilizes active sampling to select important entities based on feedback from an EA model. It also includes a label refiner to effectively train the base EA model using noisy pseudo-labels. Feedback from the EA model updates the selection policy.
  • Figure 2: Performance-cost comparison between GPT-3.5 and GPT-4 as the annotator, evaluated by MRR. We increase the budget for GPT-3.5 to evaluate its performance. [$n\times$] denotes using $n\times$ of the default query budget. Each experiment is repeated three times to show mean and standard deviation.
  • Figure 3: Analysis of the Label Refinement. We illustrate the evolution of the true positive rate (TPR) (left) and recall (middle) for refined labels across four datasets. Furthermore, we assess the robustness of the label refinement process by examining the TPR of refined labels against varying initial TPRs within the D-W-15K dataset (right), with initial pseudo-labels synthesized at different TPR levels.
  • Figure 4: Performance of entity alignment across four datasets with varying active sampling iterations, under a fixed query budget.