Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution
Jizhao Zhu, Akang Shi, Zixuan Li, Long Bai, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng
TL;DR
The paper tackles the robustness of Universal Information Extraction (UIE) by introducing RUIE-Bench, a benchmark that uses Large Language Models (LLMs) to generate diverse, realistic perturbations across NER, RE, and ED. It evaluates a wide range of UIE and traditional IE models, revealing substantial robustness gaps under perturbations and highlighting the generalization challenges of both open- and closed-source LLM-based systems. To address this, the authors propose Loss-guided Data Augmentation (LDA), which iteratively selects hard augmented samples based on inference loss, achieving 7.5% relative improvement on RUIE-Bench with only 15% of augmented data and 8.9% improvement on unseen data. The work provides a robust, cost-efficient framework for evaluating and improving UIE systems, with practical implications for deploying more reliable UIE solutions in real-world settings.
Abstract
In this paper, we aim to enhance the robustness of Universal Information Extraction (UIE) by introducing a new benchmark dataset, a comprehensive evaluation, and a feasible solution. Existing robust benchmark datasets have two key limitations: 1) They generate only a limited range of perturbations for a single Information Extraction (IE) task, which fails to evaluate the robustness of UIE models effectively; 2) They rely on small models or handcrafted rules to generate perturbations, often resulting in unnatural adversarial examples. Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench, which utilizes LLMs to generate more diverse and realistic perturbations across different IE tasks. Based on this dataset, we comprehensively evaluate existing UIE models and reveal that both LLM-based models and other models suffer from significant performance drops. To improve robustness and reduce training costs, we propose a data-augmentation solution that dynamically selects hard samples for iterative training based on the model's inference loss. Experimental results show that training with only \textbf{15\%} of the data leads to an average \textbf{7.5\%} relative performance improvement across three IE tasks.
