Table of Contents
Fetching ...

Grounding Partially-Defined Events in Multimodal Data

Kate Sanders, Reno Kriz, David Etter, Hannah Recknor, Alexander Martin, Cameron Carpenter, Jingyang Lin, Benjamin Van Durme

TL;DR

This work introduces a multimodal formulation for partially-defined events and casts the extraction of these events as a three-stage span retrieval task, and proposes a collection of LLM-driven approaches to the task of multimodal event analysis.

Abstract

How are we able to learn about complex current events just from short snippets of video? While natural language enables straightforward ways to represent under-specified, partially observable events, visual data does not facilitate analogous methods and, consequently, introduces unique challenges in event understanding. With the growing prevalence of vision-capable AI agents, these systems must be able to model events from collections of unstructured video data. To tackle robust event modeling in multimodal settings, we introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a corresponding benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. We propose a collection of LLM-driven approaches to the task of multimodal event analysis, and evaluate them on MultiVENT-G. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.

Grounding Partially-Defined Events in Multimodal Data

TL;DR

This work introduces a multimodal formulation for partially-defined events and casts the extraction of these events as a three-stage span retrieval task, and proposes a collection of LLM-driven approaches to the task of multimodal event analysis.

Abstract

How are we able to learn about complex current events just from short snippets of video? While natural language enables straightforward ways to represent under-specified, partially observable events, visual data does not facilitate analogous methods and, consequently, introduces unique challenges in event understanding. With the growing prevalence of vision-capable AI agents, these systems must be able to model events from collections of unstructured video data. To tackle robust event modeling in multimodal settings, we introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a corresponding benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. We propose a collection of LLM-driven approaches to the task of multimodal event analysis, and evaluate them on MultiVENT-G. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
Paper Structure (51 sections, 3 equations, 13 figures, 10 tables)

This paper contains 51 sections, 3 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: In MultiVENT-G every video-text pair is labeled with (a) an event template that guides the annotations, and annotations identifying entities that help fill these template roles at the (b) text, (c) temporal (video time stamps), and (d) spatial levels. Spatial annotations are also paired with (e) a natural language description of the visual content (or transcription of the readable text), a flag for whether the entity is natural language within the video, and confidence for how sure the annotator is that the entity relates to the event role in question.
  • Figure 2: An illustration demonstrating the relationship between a pre-specified event and a collection of videos. Each role can simultaneously be defined by a set of event roles and by set of sub-events. Some subset of the sub-events characterize the role fillers, and some subset of the events depicted in a given video depict sub-events of the event.
  • Figure 3: Example video-description pairs depicting the Notre Dame Cathedral fire pulled from MultiVENT-G. These videos illustrate how a single event can be described differently by different video clips: (A) provides the same semantic and temporal information as the reference video, but from a different perspective. (B) shows different semantic details, such as the fire trucks pictured at the bottom of the video frame. (C) was taken later than the reference video, showing a separate temporal snippet from the event in question with changed event semantics.
  • Figure 4: The distribution, using a logarithmic scale, of spatial entity labels (post-data cleaning) in MultiVENT-G. The labels reflect the domain of annotated content, e.g., as many videos depict emergencies, "police" and "fire" labels are common.
  • Figure 5: The complete partially-defined event understanding task, broken down into three stages. Stage 1 retrieves relevant text spans, stage 2 retrieves relevant temporal spans, and stage 3 retrieves relevant spatial spans. The output of computing these sub-tasks in sequence is a filled event template using both textual and spatio-temporal evidence from the video-language data.
  • ...and 8 more figures