Open-Vocabulary Action Localization with Iterative Visual Prompting
Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi
TL;DR
This work tackles open-vocabulary video action localization without any training data by extending a prompting-based framework (PIVOT) to temporal localization. It introduces Temporal PIVOT (T-PIVOT), which iteratively samples frames, tiles them with frame indices into a single image, and uses a vision-language model to identify frames near action boundaries before progressively narrowing the search window. The approach enables free-text queries to specify actions and demonstrates competitive zero-shot performance on cooking and benchmark datasets, along with thorough ablations on iterations, grid tiling, and prompting strategies. The method offers a practical, adaptable tool for rapid video labeling and robotics training, albeit with latency and sensitivity to VLM quality and prompting choices that warrant further optimization.
Abstract
Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal boundaries. We demonstrate that this technique yields reasonable performance, achieving results comparable to state-of-the-art zero-shot action localization. These results support the use of VLMs as a practical tool for understanding videos. Sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/
