Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time
Mihai Masala, Marius Leordeanu
TL;DR
The paper tackles the challenge of zero-shot, explainable video description by introducing Graph of Events in Space and Time (GEST) as an explicit, grounded intermediate representation. It builds GEST from frame-level vision cues using pre-trained detectors, then converts the graph into a proto-language, which is refined into fluent natural language with large language models in a two-stage process. The approach emphasizes explainability through explicit spatio-temporal relations and event-level reasoning, and it is evaluated across five diverse datasets using both standard captioning metrics and LLM-based ranking (LLM-as-Jury), including a combined GEST+VidIL setup that often achieves top performance. The results demonstrate strong zero-shot capabilities, particularly on complex, multi-actor datasets like Videos-to-Paragraphs, and suggest broad applicability to grounded video understanding tasks such as surveillance.
Abstract
In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language. We validate that our algorithmic approach is able to generate coherent, rich and relevant textual descriptions on videos collected from a variety of datasets, using both standard metrics (e.g. Bleu, ROUGE) and the modern LLM-as-a-Jury approach.
