Table of Contents
Fetching ...

Open-Vocabulary Action Localization with Iterative Visual Prompting

Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

TL;DR

This work tackles open-vocabulary video action localization without any training data by extending a prompting-based framework (PIVOT) to temporal localization. It introduces Temporal PIVOT (T-PIVOT), which iteratively samples frames, tiles them with frame indices into a single image, and uses a vision-language model to identify frames near action boundaries before progressively narrowing the search window. The approach enables free-text queries to specify actions and demonstrates competitive zero-shot performance on cooking and benchmark datasets, along with thorough ablations on iterations, grid tiling, and prompting strategies. The method offers a practical, adaptable tool for rapid video labeling and robotics training, albeit with latency and sensitivity to VLM quality and prompting choices that warrant further optimization.

Abstract

Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal boundaries. We demonstrate that this technique yields reasonable performance, achieving results comparable to state-of-the-art zero-shot action localization. These results support the use of VLMs as a practical tool for understanding videos. Sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/

Open-Vocabulary Action Localization with Iterative Visual Prompting

TL;DR

This work tackles open-vocabulary video action localization without any training data by extending a prompting-based framework (PIVOT) to temporal localization. It introduces Temporal PIVOT (T-PIVOT), which iteratively samples frames, tiles them with frame indices into a single image, and uses a vision-language model to identify frames near action boundaries before progressively narrowing the search window. The approach enables free-text queries to specify actions and demonstrates competitive zero-shot performance on cooking and benchmark datasets, along with thorough ablations on iterations, grid tiling, and prompting strategies. The method offers a practical, adaptable tool for rapid video labeling and robotics training, albeit with latency and sensitivity to VLM quality and prompting choices that warrant further optimization.

Abstract

Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal boundaries. We demonstrate that this technique yields reasonable performance, achieving results comparable to state-of-the-art zero-shot action localization. These results support the use of VLMs as a practical tool for understanding videos. Sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/
Paper Structure (18 sections, 1 equation, 5 figures, 6 tables, 2 algorithms)

This paper contains 18 sections, 1 equation, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Open-vocabulary video action localization aims to find the start and end times of an action specified by open-vocabulary free-text queries. We propose a training-free approach that leverages an off-the-shelf vision-language model (e.g., OpenAI's GPT-4o).
  • Figure 2: The proposed pipeline for open-vocabulary video action localization using a VLM consists of the following steps: (a) Frames are sampled at regular intervals from a time window, covering the entire video during the first iteration. (b) The sampled frames are then tiled in an image with annotations indicating the time order of the frames. (c) This image is then fed into a VLM to identify the frames closest to a specific timing of an action (e.g., the start time of an action). (d) The sampling window is updated by centering on the selected frame with a narrower sampling interval. Bottom panel (1) For general action localization, the start time of the action in the video is determined by iterating steps (a) to (d). Bottom panel (2) By estimating the end time of the action in the same manner, the action is localized in the video.
  • Figure 3: Qualitative results of open-vocabulary video action localization on a cooking video.
  • Figure 4: Different types of visual prompting styles tested in this study.
  • Figure 5: Action localization performance plotted against the number of action steps and the video length. (a) Breakfast Dataset and (b) Fine-grained Breakfast Dataset.