Table of Contents
Fetching ...

GenEARL: A Training-Free Generative Framework for Multimodal Event Argument Role Labeling

Hritik Bansal, Po-Nien Kung, P. Jeffrey Brantingham, Kai-Wei Chang, Nanyun Peng

TL;DR

GenEARL presents a training-free, two-stage framework for multimodal EARL that uses a frozen GVLM to generate object-centric event descriptions and a frozen LLM to assign event argument roles, enabling strong zero- and few-shot generalization to unseen events without event-annotated training data. The approach outperforms zero-shot CLIP baselines on M^2E^2 and SWiG and shows competitive performance with few-shot LLM prompting, while ablations confirm the importance of the LLM in interpreting GVLM outputs and the benefits of rich conditioning in prompts. Human evaluation indicates room for improvement in the quality of GVLM-generated descriptions, suggesting future gains from improved prompt design and GVLM capabilities. Overall, GenEARL demonstrates a practical, training-free alternative for flexible and generalizable multimodal EARL with potential for rapid adaptation to new domains and event types. The work highlights the growing utility of generative models for structured vision-language tasks without requiring costly event-annotated data, offering a scalable path for real-world multimodal event understanding.

Abstract

Multimodal event argument role labeling (EARL), a task that assigns a role for each event participant (object) in an image is a complex challenge. It requires reasoning over the entire image, the depicted event, and the interactions between various objects participating in the event. Existing models heavily rely on high-quality event-annotated training data to understand the event semantics and structures, and they fail to generalize to new event types and domains. In this paper, we propose GenEARL, a training-free generative framework that harness the power of the modern generative models to understand event task descriptions given image contexts to perform the EARL task. Specifically, GenEARL comprises two stages of generative prompting with a frozen vision-language model (VLM) and a frozen large language model (LLM). First, a generative VLM learns the semantics of the event argument roles and generates event-centric object descriptions based on the image. Subsequently, a LLM is prompted with the generated object descriptions with a predefined template for EARL (i.e., assign an object with an event argument role). We show that GenEARL outperforms the contrastive pretraining (CLIP) baseline by 9.4% and 14.2% accuracy for zero-shot EARL on the M2E2 and SwiG datasets, respectively. In addition, we outperform CLIP-Event by 22% precision on M2E2 dataset. The framework also allows flexible adaptation and generalization to unseen domains.

GenEARL: A Training-Free Generative Framework for Multimodal Event Argument Role Labeling

TL;DR

GenEARL presents a training-free, two-stage framework for multimodal EARL that uses a frozen GVLM to generate object-centric event descriptions and a frozen LLM to assign event argument roles, enabling strong zero- and few-shot generalization to unseen events without event-annotated training data. The approach outperforms zero-shot CLIP baselines on M^2E^2 and SWiG and shows competitive performance with few-shot LLM prompting, while ablations confirm the importance of the LLM in interpreting GVLM outputs and the benefits of rich conditioning in prompts. Human evaluation indicates room for improvement in the quality of GVLM-generated descriptions, suggesting future gains from improved prompt design and GVLM capabilities. Overall, GenEARL demonstrates a practical, training-free alternative for flexible and generalizable multimodal EARL with potential for rapid adaptation to new domains and event types. The work highlights the growing utility of generative models for structured vision-language tasks without requiring costly event-annotated data, offering a scalable path for real-world multimodal event understanding.

Abstract

Multimodal event argument role labeling (EARL), a task that assigns a role for each event participant (object) in an image is a complex challenge. It requires reasoning over the entire image, the depicted event, and the interactions between various objects participating in the event. Existing models heavily rely on high-quality event-annotated training data to understand the event semantics and structures, and they fail to generalize to new event types and domains. In this paper, we propose GenEARL, a training-free generative framework that harness the power of the modern generative models to understand event task descriptions given image contexts to perform the EARL task. Specifically, GenEARL comprises two stages of generative prompting with a frozen vision-language model (VLM) and a frozen large language model (LLM). First, a generative VLM learns the semantics of the event argument roles and generates event-centric object descriptions based on the image. Subsequently, a LLM is prompted with the generated object descriptions with a predefined template for EARL (i.e., assign an object with an event argument role). We show that GenEARL outperforms the contrastive pretraining (CLIP) baseline by 9.4% and 14.2% accuracy for zero-shot EARL on the M2E2 and SwiG datasets, respectively. In addition, we outperform CLIP-Event by 22% precision on M2E2 dataset. The framework also allows flexible adaptation and generalization to unseen domains.
Paper Structure (32 sections, 15 figures, 8 tables)

This paper contains 32 sections, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Overview of Multimodal Event Argument Role Labeling task. Given the image that depicts an Arrest event type, a list of possible event argument roles (Agent, Person, Instrument), and three participant objects (bounding boxes) of color 'red', 'blue', and 'yellow'. The task is to assign an event argument to each of the objects based on their role in the depicted event. Here, the 'blue' bounding box plays the role of the Person who gets arrested whereas the object in the 'red' and 'yellow' plays the role of the Agent performing the arrest.
  • Figure 2: Overview of the GenEARL framework for multimodal event argument role labeling. It comprises two stages of generative prompting. In the first stage, a generative vision-language model like LLaVA is prompted with the multimodal input prompt to generate an event-centric object role description. In the second stage, we extract the argument role label based on the generated object description using a large language model like ChatGPT. A: The raw input instance data is converted into a multimodal prompt for GVLM. B: The input prompt is fed to the GVLM for object role description generation. C: The generated object role description is embedded in the template used to prompt the LLM. GVLM is also used to generate a caption for the image that depicts the event. D: The event details including the possible argument role labels are added in the LLM prompt.
  • Figure 3: Human assessment of the quality of the generated object role descriptions from the LLaVA model.
  • Figure 4: Multimodal input template used to prompt the GLVM for event argument role labeling. Here, the model is provided with the image, event and possible argument role labels (and definitions). We add 'Other' to the list of possible argument role labels in case it does not prefer any of the existing event argument role labels. We get 256 visual tokens for the image or the object by projecting the raw input into the vision embedding space using the visual processing module of the GVLM.
  • Figure 5: Qualitative Example for the predictions by the GenEARL framework in three-shot setting.
  • ...and 10 more figures