Evaluating Generative Language Models in Information Extraction as Subjective Question Correction
Yuchen Fan, Yantao Liu, Zijun Yao, Jifan Yu, Lei Hou, Juanzi Li
TL;DR
This paper tackles the problem that standard evaluation metrics inadequately measure semantic correctness and completeness in information extraction (IE) tasks for large language models (LLMs). It introduces SQC-Score, which combines a fine-tuned matcher (trained on subjective question correction data) to semantically align predictions with gold labels and an NLI-based complementer to augment gold labels with entailed yet missing answers. Across NYT-11, ACE-2005 RE/ED/EAE tasks, SQC-Score outperforms traditional metrics and correlates better with human judgment, especially for shallow IE tasks where LLMs show larger gains. The work highlights that while LLMs can match or exceed conventional models on simple IE tasks when evaluated with SQC-Score, they still struggle with strictly structured event extraction, underscoring the need for improved evaluation and model alignment in IE research. These insights, along with the released datasets and code, offer a practical pathway to more human-aligned evaluation and targeted improvements in IE with LLMs.
Abstract
Modern Large Language Models (LLMs) have showcased remarkable prowess in various tasks necessitating sophisticated cognitive behaviors. Nevertheless, a paradoxical performance discrepancy is observed, where these models underperform in seemingly elementary tasks like relation extraction and event extraction due to two issues in conventional evaluation. (1) The imprecision of existing evaluation metrics that struggle to effectively gauge semantic consistency between model outputs and ground truth, and (2) The inherent incompleteness of evaluation benchmarks, primarily due to restrictive human annotation schemas, resulting in underestimated LLM performances. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. This method innovatively utilizes LLMs, fine-tuned through subjective question correction data, to refine matching between model outputs and golden labels. Additionally, by incorporating a Natural Language Inference (NLI) model, SQC-Score enriches golden labels, addressing benchmark incompleteness by acknowledging correct yet previously omitted answers. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics. Utilizing SQC-Score, we conduct a comprehensive evaluation of the state-of-the-art LLMs and provide insights for future research for information extraction. Dataset and associated codes can be accessed at https://github.com/THU-KEG/SQC-Score.
