EventRL: Enhancing Event Extraction with Outcome Supervision for Large Language Models

Jun Gao; Huan Zhao; Wei Wang; Changlong Yu; Ruifeng Xu

EventRL: Enhancing Event Extraction with Outcome Supervision for Large Language Models

Jun Gao, Huan Zhao, Wei Wang, Changlong Yu, Ruifeng Xu

TL;DR

EventRL introduces outcome supervision for event extraction (EE) with LLMs, addressing instruction-following and hallucination by shaping policy updates via rewards based on Trigger-F1 and Argument-F1. The framework initializes from supervised fine-tuning, then uses reinforcement learning with Arg-F1, AVG-F1, or Prod-F1 rewards, paired with stabilization techniques like Teacher-Force Threshold and Advantage Clipping to maintain learning stability. Experiments on ACE05 across LLaMa and CodeLLaMa show EventRL consistently outperforms Few-Shot Prompting and standard SFT, with notable gains on unseen event types and when incorporating code data pretraining. The work highlights the importance of reward design and data modality, demonstrating that larger models generalize better up to a point, and that outcome supervision can yield robust EE with improved structure and fewer undefined events, albeit at higher computational cost and with dataset-quality considerations.

Abstract

In this study, we present EventRL, a reinforcement learning approach developed to enhance event extraction for large language models (LLMs). EventRL utilizes outcome supervision with specific reward functions to tackle prevalent challenges in LLMs, such as instruction following and hallucination, manifested as the mismatch of event structure and the generation of undefined event types. We evaluate EventRL against existing methods like Few-Shot Prompting (FSP) (based on GPT4) and Supervised Fine-Tuning (SFT) across various LLMs, including GPT-4, LLaMa, and CodeLLaMa models. Our findings show that EventRL significantly outperforms these conventional approaches by improving the performance in identifying and structuring events, particularly in handling novel event types. The study emphasizes the critical role of reward function selection and demonstrates the benefits of incorporating code data for better event extraction. While increasing model size leads to higher accuracy, maintaining the ability to generalize is essential to avoid overfitting.

EventRL: Enhancing Event Extraction with Outcome Supervision for Large Language Models

TL;DR

Abstract

Paper Structure (46 sections, 3 equations, 12 figures, 5 tables)

This paper contains 46 sections, 3 equations, 12 figures, 5 tables.

Introduction
Related Work
Event Extraction
Large Language Models and Outcome Supervision
EventRL
Overview
Initialization
Input and Output Format
Supervised Fine-tuning
Outcome Supervision with RL
Problem Formulation
Reward Function
Advantage Calculation
Stabilization Strategies in EventRL
Teacher-Force Threshold
...and 31 more sections

Figures (12)

Figure 1: Examples of common errors in LLM-Based event extraction. The left side depicts an error of generating an undefined event type, specifically an unexpected "Vote" event not included in the guidelines. The right side shows a structural mismatch error within an "Attack" event, incorporating an "Entity" argument that deviates from the pre-defined event schema.
Figure 2: The EventRL framework architecture, demonstrating the process from initialization with an SFT Model $M_0$, through iterative model updates $M_t$ to $M_{t+1}$ via Outcome Supervision. This includes using reinforcement learning with reward functions based on Trigger-F1 and Argument-F1 scores, which guide policy gradient updates for enhanced event extraction from text.
Figure 3: Illustration of the input-output format in EventRL for event extraction. The input includes Event Definitions in Python dataclass format and a natural language instruction. The output showcases a Python list of dataclass instances as the Response, representing extracted events from the given text. The complete event definitions can be found in Figure \ref{['fig:definitions']} in Appendix.
Figure 4: This chart quantifies the error counts for undefined event types and structural mismatches in event extraction on the LLaMa-7B model, comparing SFT with three EventRL training methods: Arg-F1, AVG-F1, and Prod-F1.
Figure 5: A comparison of event extraction results between LLaMa-7B + SFT and LLaMa-7B + EventRL (Prod-F1). Note that here the results of EventRL (Prod-F1) are totally accurate.
...and 7 more figures

EventRL: Enhancing Event Extraction with Outcome Supervision for Large Language Models

TL;DR

Abstract

EventRL: Enhancing Event Extraction with Outcome Supervision for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)