Table of Contents
Fetching ...

GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction

Oscar Sainz, Iker García-Ferrero, Rodrigo Agerri, Oier Lopez de Lacalle, German Rigau, Eneko Agirre

TL;DR

GoLLIE tackles the challenge of zero-shot Information Extraction by fine-tuning a Code-LLaMA-based LLM to follow annotation guidelines. It introduces a code-based input-output representation and embeds guidelines within the prompt, reinforced by training-time regularization to ensure guideline adherence. Empirical results show GoLLIE outperforms prior zero-shot IE approaches (e.g., Instruct-UIE, zs4ie) across diverse domains, with an ablation study highlighting the critical role of guideline details and representative candidates. The work suggests that guideline-driven, instruction-tuned LLMs can substantially reduce reliance on extensive labeled data while generalizing to unseen schemas, with implications for scalable, cross-domain IE.

Abstract

Large Language Models (LLMs) combined with instruction tuning have made significant progress when generalizing to unseen tasks. However, they have been less successful in Information Extraction (IE), lagging behind task-specific models. Typically, IE tasks are characterized by complex annotation guidelines that describe the task and give examples to humans. Previous attempts to leverage such information have failed, even with the largest models, as they are not able to follow the guidelines out of the box. In this paper, we propose GoLLIE (Guideline-following Large Language Model for IE), a model able to improve zero-shot results on unseen IE tasks by virtue of being fine-tuned to comply with annotation guidelines. Comprehensive evaluation empirically demonstrates that GoLLIE is able to generalize to and follow unseen guidelines, outperforming previous attempts at zero-shot information extraction. The ablation study shows that detailed guidelines are key for good results.

GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction

TL;DR

GoLLIE tackles the challenge of zero-shot Information Extraction by fine-tuning a Code-LLaMA-based LLM to follow annotation guidelines. It introduces a code-based input-output representation and embeds guidelines within the prompt, reinforced by training-time regularization to ensure guideline adherence. Empirical results show GoLLIE outperforms prior zero-shot IE approaches (e.g., Instruct-UIE, zs4ie) across diverse domains, with an ablation study highlighting the critical role of guideline details and representative candidates. The work suggests that guideline-driven, instruction-tuned LLMs can substantially reduce reliance on extensive labeled data while generalizing to unseen schemas, with implications for scalable, cross-domain IE.

Abstract

Large Language Models (LLMs) combined with instruction tuning have made significant progress when generalizing to unseen tasks. However, they have been less successful in Information Extraction (IE), lagging behind task-specific models. Typically, IE tasks are characterized by complex annotation guidelines that describe the task and give examples to humans. Previous attempts to leverage such information have failed, even with the largest models, as they are not able to follow the guidelines out of the box. In this paper, we propose GoLLIE (Guideline-following Large Language Model for IE), a model able to improve zero-shot results on unseen IE tasks by virtue of being fine-tuned to comply with annotation guidelines. Comprehensive evaluation empirically demonstrates that GoLLIE is able to generalize to and follow unseen guidelines, outperforming previous attempts at zero-shot information extraction. The ablation study shows that detailed guidelines are key for good results.
Paper Structure (37 sections, 8 figures, 11 tables)

This paper contains 37 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Out of domain zero-shot NER results. GPT results are not available for all domains.
  • Figure 2: Example of the input and output of the model.
  • Figure 3: Example of the input representation. (left) An example of an event definition w/o guidelines information. (right) The same example but with guideline information as Python comments.
  • Figure 4: Seen vs unseen label zero-shot performance, results aggregated from all datasets.
  • Figure 5: Example of generalization to custom tasks defined by the user.
  • ...and 3 more figures