Table of Contents
Fetching ...

Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines

Saurabh Srivastava, Sweta Pati, Ziyu Yao

TL;DR

This work investigates how annotation guidelines—textual descriptions of event types and their arguments—affect instruction-tuning of large language models for event extraction (EE). By representing EE outputs in a code-based format and augmenting instructions with both human- and machine-generated guidelines, the authors conduct extensive experiments across ACE05 and RichERE datasets, multiple data regimes, and diverse model architectures, including LLaMA and Qwen. They show that well-constructed guidelines improve event-type discrimination, cross-schema generalization, and data-scarce performance, with machine-generated guidelines often outperforming human-written ones, especially when diversity is ensured through multiple variants. The results demonstrate robust gains across models, domains, and schemas, highlighting the practical value of automated guideline generation for scalable EE systems and pointing to promising directions for future zero-shot and low-resource extraction tasks.

Abstract

In this work, we study the effect of annotation guidelines -- textual descriptions of event types and arguments, when instruction-tuning large language models for event extraction. We conducted a series of experiments with both human-provided and machine-generated guidelines in both full- and low-data settings. Our results demonstrate the promise of annotation guidelines when there is a decent amount of training data and highlight its effectiveness in improving cross-schema generalization and low-frequency event-type performance.

Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines

TL;DR

This work investigates how annotation guidelines—textual descriptions of event types and their arguments—affect instruction-tuning of large language models for event extraction (EE). By representing EE outputs in a code-based format and augmenting instructions with both human- and machine-generated guidelines, the authors conduct extensive experiments across ACE05 and RichERE datasets, multiple data regimes, and diverse model architectures, including LLaMA and Qwen. They show that well-constructed guidelines improve event-type discrimination, cross-schema generalization, and data-scarce performance, with machine-generated guidelines often outperforming human-written ones, especially when diversity is ensured through multiple variants. The results demonstrate robust gains across models, domains, and schemas, highlighting the practical value of automated guideline generation for scalable EE systems and pointing to promising directions for future zero-shot and low-resource extraction tasks.

Abstract

In this work, we study the effect of annotation guidelines -- textual descriptions of event types and arguments, when instruction-tuning large language models for event extraction. We conducted a series of experiments with both human-provided and machine-generated guidelines in both full- and low-data settings. Our results demonstrate the promise of annotation guidelines when there is a decent amount of training data and highlight its effectiveness in improving cross-schema generalization and low-frequency event-type performance.

Paper Structure

This paper contains 36 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of our exploration of automatically generating annotation guidelines to augment code-format instruction tuning for EE. Prompt template for Guideline-PN and the example outputs are shown.
  • Figure 2: Error categorization: CA (Context Ambiguity), PE (Parsing Errors), MAE (Missing Arguments/Events), AE (Argument Errors), TTE (Type/Trigger Errors), and LN (Label Noise).
  • Figure 3: Impact of guidelines on AC scores per ET, sorted by frequency in the full training set. Smaller index indicate a higher frequency. Green/red bars indicate improvements/declines. Dashed/solid lines denote average AC scores without/with guidelines.