Table of Contents
Fetching ...

VideoGEM: Training-free Action Grounding in Videos

Felix Vogel, Walid Bousselham, Anna Kukleva, Nina Shvetsova, Hilde Kuehne

TL;DR

VideoGEM addresses the challenge of zero-shot spatial action grounding in videos by leveraging training-free vision-language backbones through a video-adapted GEM framework. It introduces layer weighting to emphasize higher-level action concepts and prompt decomposition to reduce object bias, combining verb, object, and action prompts to produce robust localization heatmaps. The method, evaluated on CLIP, OpenCLIP, and ViCLIP across four datasets, consistently outperforms trained state-of-the-art approaches and demonstrates the benefit of both static and dynamic layer weighting as well as prompt decomposition. This work enables practical, training-free action grounding with broad backbone compatibility, highlighting the potential of layer-aware self-attention and modular prompts for complex video understanding.

Abstract

Vision-language foundation models have shown impressive capabilities across various zero-shot tasks, including training-free localization and grounding, primarily focusing on localizing objects in images. However, leveraging those capabilities to localize actions and events in videos is challenging, as actions have less physical outline and are usually described by higher-level concepts. In this work, we propose VideoGEM, the first training-free spatial action grounding method based on pretrained image- and video-language backbones. Namely, we adapt the self-self attention formulation of GEM to spatial activity grounding. We observe that high-level semantic concepts, such as actions, usually emerge in the higher layers of the image- and video-language models. We, therefore, propose a layer weighting in the self-attention path to prioritize higher layers. Additionally, we introduce a dynamic weighting method to automatically tune layer weights to capture each layer`s relevance to a specific prompt. Finally, we introduce a prompt decomposition, processing action, verb, and object prompts separately, resulting in a better spatial localization of actions. We evaluate the proposed approach on three image- and video-language backbones, CLIP, OpenCLIP, and ViCLIP, and on four video grounding datasets, V-HICO, DALY, YouCook-Interactions, and GroundingYouTube, showing that the proposed training-free approach is able to outperform current trained state-of-the-art approaches for spatial video grounding.

VideoGEM: Training-free Action Grounding in Videos

TL;DR

VideoGEM addresses the challenge of zero-shot spatial action grounding in videos by leveraging training-free vision-language backbones through a video-adapted GEM framework. It introduces layer weighting to emphasize higher-level action concepts and prompt decomposition to reduce object bias, combining verb, object, and action prompts to produce robust localization heatmaps. The method, evaluated on CLIP, OpenCLIP, and ViCLIP across four datasets, consistently outperforms trained state-of-the-art approaches and demonstrates the benefit of both static and dynamic layer weighting as well as prompt decomposition. This work enables practical, training-free action grounding with broad backbone compatibility, highlighting the potential of layer-aware self-attention and modular prompts for complex video understanding.

Abstract

Vision-language foundation models have shown impressive capabilities across various zero-shot tasks, including training-free localization and grounding, primarily focusing on localizing objects in images. However, leveraging those capabilities to localize actions and events in videos is challenging, as actions have less physical outline and are usually described by higher-level concepts. In this work, we propose VideoGEM, the first training-free spatial action grounding method based on pretrained image- and video-language backbones. Namely, we adapt the self-self attention formulation of GEM to spatial activity grounding. We observe that high-level semantic concepts, such as actions, usually emerge in the higher layers of the image- and video-language models. We, therefore, propose a layer weighting in the self-attention path to prioritize higher layers. Additionally, we introduce a dynamic weighting method to automatically tune layer weights to capture each layer`s relevance to a specific prompt. Finally, we introduce a prompt decomposition, processing action, verb, and object prompts separately, resulting in a better spatial localization of actions. We evaluate the proposed approach on three image- and video-language backbones, CLIP, OpenCLIP, and ViCLIP, and on four video grounding datasets, V-HICO, DALY, YouCook-Interactions, and GroundingYouTube, showing that the proposed training-free approach is able to outperform current trained state-of-the-art approaches for spatial video grounding.

Paper Structure

This paper contains 23 sections, 13 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Prompt decomposition and combination. First, we decompose the action query into verb, object, and action prompts. For each component, we then predict locations corresponding to the highest values on the heatmaps (red-high, blue-low). To determine the final location (white star), we calculate the center point of these individual predictions. The red, blue, and yellow stars represent the predicted locations for the action, object, and verb prompts, respectively, while the dark green bounding box represents the annotated ground truth.
  • Figure 2: Left: VideoGEM pipeline. VideoGEM takes a video and its corresponding narration as input. Our Weighted GEM processes the input video alongside the vision transformer to generate the representative patch tokens. Decomposition of the input narration into verb prompt, object prompt, and action prompt (see \ref{['sec:method_PromptDecomposition']} for details) are passed through the text encoder to obtain three [EOS] tokens, respectively. Then, three heatmaps are calculated as a similarity between patch tokens and the respective [EOS] tokens. We then aggregate the heatmaps into one final prediction by centering the individual predicted locations. Right: Layer weighting. In our Weighted GEM architecture, we apply a combination of static and dynamic weights. Dynamic weights are applied to the last $D$ layers, while static weights are applied to the last $K$ layers, with $K > D$. Additionally, the attention map $X^{L-K}$ is weighted by a corresponding static weight $w_s^{L-K}$. All weighted outputs from the self-attention blocks are then summed with the weighted $X^{L-K}$ attention map to produce representative patch tokens. The output patch tokens of weighted GEM are used for similarity calculation with the text, resulting in an attention heatmap.
  • Figure 3: Importance of GEM layers. The accuracy of GEM with one removed layer is calculated. The removed layer index is on the x-axis where $1$ is the final layer of GEM going down to $8$ which is the initial self attention input to GEM.
  • Figure 4: Influence of the number of GEM layers. Up to seven layers are added for GEM starting with a self-self attention layer for the final Transformer block. With zero layers, the output equals to the output of the backbone without GEM.
  • Figure 5: Comparison of GEM and our proposed weighting mechanism. The proposed weighting mechanism is illustrated on the left. Static weights and dynamic weights can be applied independent of each other. While static weights can be set heuristically or via hyperparameter search based on the general pipeline perfromance, dynamic weights are adapting to the importance of the different transformer layers individually and with respect to each prompt as described in \ref{['sec:method_LayerWeighting']}. Standard GEM does not use any weights, which equals to always using $1-$weights in our formulation.
  • ...and 2 more figures