Table of Contents
Fetching ...

Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time

Mihai Masala, Marius Leordeanu

TL;DR

The paper tackles the challenge of zero-shot, explainable video description by introducing Graph of Events in Space and Time (GEST) as an explicit, grounded intermediate representation. It builds GEST from frame-level vision cues using pre-trained detectors, then converts the graph into a proto-language, which is refined into fluent natural language with large language models in a two-stage process. The approach emphasizes explainability through explicit spatio-temporal relations and event-level reasoning, and it is evaluated across five diverse datasets using both standard captioning metrics and LLM-based ranking (LLM-as-Jury), including a combined GEST+VidIL setup that often achieves top performance. The results demonstrate strong zero-shot capabilities, particularly on complex, multi-actor datasets like Videos-to-Paragraphs, and suggest broad applicability to grounded video understanding tasks such as surveillance.

Abstract

In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language. We validate that our algorithmic approach is able to generate coherent, rich and relevant textual descriptions on videos collected from a variety of datasets, using both standard metrics (e.g. Bleu, ROUGE) and the modern LLM-as-a-Jury approach.

Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time

TL;DR

The paper tackles the challenge of zero-shot, explainable video description by introducing Graph of Events in Space and Time (GEST) as an explicit, grounded intermediate representation. It builds GEST from frame-level vision cues using pre-trained detectors, then converts the graph into a proto-language, which is refined into fluent natural language with large language models in a two-stage process. The approach emphasizes explainability through explicit spatio-temporal relations and event-level reasoning, and it is evaluated across five diverse datasets using both standard captioning metrics and LLM-based ranking (LLM-as-Jury), including a combined GEST+VidIL setup that often achieves top performance. The results demonstrate strong zero-shot capabilities, particularly on complex, multi-actor datasets like Videos-to-Paragraphs, and suggest broad applicability to grounded video understanding tasks such as surveillance.

Abstract

In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language. We validate that our algorithmic approach is able to generate coherent, rich and relevant textual descriptions on videos collected from a variety of datasets, using both standard metrics (e.g. Bleu, ROUGE) and the modern LLM-as-a-Jury approach.
Paper Structure (14 sections, 6 figures, 6 tables)

This paper contains 14 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: An overview of our approach. Starting from a raw video we perform object detection and tracking, action detection, semantic segmentation and depth estimation. We aggregate this information to build the corresponding Graph of Events in Space and Time. By reasoning over (e.g., temporally and spatially sorting the graph, describing the events) this graph we build an intermediate representation in the form of a proto language. We prompt existing LLMs to take this proto language and transform it in a fully fledged natural, rich and accurate textual description. Furthermore, trusting LLMs with enough power to alter certain parts of the events (e.g., a miss-identified object) and learning from this process allows us to update the graph in order to obtain a more context-aware and accurate representation.
  • Figure 2: A complete example of our proposed pipeline. Starting from the video, we automatically build the associated GEST. From this graph, we build the proto-language that is then fed to an LLM that generates the final textual description.
  • Figure 3: On the left, an example of extracted events in space and time, with start and end frame. On the right, a high-level representation of the algorithm used for building the proto language.
  • Figure 4: The prompt used for generating the final text description.
  • Figure 5: Video duration statistics per dataset.
  • ...and 1 more figures