Table of Contents
Fetching ...

Training-Free Action Recognition and Goal Inference with Dynamic Frame Selection

Ee Yeo Keat, Zhang Hao, Alexander Matyasko, Basura Fernando

TL;DR

VidTFS presents a training-free, open-vocabulary framework for video goal inference and action recognition by coupling frozen vision models (BLIP-2, CLIP) with an open-vocabulary LLM (Vicuna) in a four-stage See–Guess–Select–Infer pipeline. A novel dynamic frame selection module (evidence selector) uses CLIP to align hypothesized steps with visual frames, restricting processing to a small, informative subset (M ≤ 16). The method achieves competitive to state-of-the-art results across four datasets (CrossTask, COIN, UCF101, ActivityNet) without task-specific training, and ablations validate the effectiveness of frame selection, hypothesis expansion, and CLIP-based evidence matching. While promising in training-free settings, VidTFS inherits LLM drawbacks such as potential hallucinations and limited explainability, suggesting directions for improved controllability and interpretability in open-vocabulary video reasoning.

Abstract

We introduce VidTFS, a Training-free, open-vocabulary video goal and action inference framework that combines the frozen vision foundational model (VFM) and large language model (LLM) with a novel dynamic Frame Selection module. Our experiments demonstrate that the proposed frame selection module improves the performance of the framework significantly. We validate the performance of the proposed VidTFS on four widely used video datasets, including CrossTask, COIN, UCF101, and ActivityNet, covering goal inference and action recognition tasks under open-vocabulary settings without requiring any training or fine-tuning. The results show that VidTFS outperforms pretrained and instruction-tuned multimodal language models that directly stack LLM and VFM for downstream video inference tasks. Our VidTFS with its adaptability shows the future potential for generalizing to new training-free video inference tasks.

Training-Free Action Recognition and Goal Inference with Dynamic Frame Selection

TL;DR

VidTFS presents a training-free, open-vocabulary framework for video goal inference and action recognition by coupling frozen vision models (BLIP-2, CLIP) with an open-vocabulary LLM (Vicuna) in a four-stage See–Guess–Select–Infer pipeline. A novel dynamic frame selection module (evidence selector) uses CLIP to align hypothesized steps with visual frames, restricting processing to a small, informative subset (M ≤ 16). The method achieves competitive to state-of-the-art results across four datasets (CrossTask, COIN, UCF101, ActivityNet) without task-specific training, and ablations validate the effectiveness of frame selection, hypothesis expansion, and CLIP-based evidence matching. While promising in training-free settings, VidTFS inherits LLM drawbacks such as potential hallucinations and limited explainability, suggesting directions for improved controllability and interpretability in open-vocabulary video reasoning.

Abstract

We introduce VidTFS, a Training-free, open-vocabulary video goal and action inference framework that combines the frozen vision foundational model (VFM) and large language model (LLM) with a novel dynamic Frame Selection module. Our experiments demonstrate that the proposed frame selection module improves the performance of the framework significantly. We validate the performance of the proposed VidTFS on four widely used video datasets, including CrossTask, COIN, UCF101, and ActivityNet, covering goal inference and action recognition tasks under open-vocabulary settings without requiring any training or fine-tuning. The results show that VidTFS outperforms pretrained and instruction-tuned multimodal language models that directly stack LLM and VFM for downstream video inference tasks. Our VidTFS with its adaptability shows the future potential for generalizing to new training-free video inference tasks.
Paper Structure (41 sections, 5 equations, 7 figures, 17 tables)

This paper contains 41 sections, 5 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: VidTFS contains four stages: See, Guess, Select, and Infer. (1). Seeing through Visual Descriptor (i.e., BLIP-2) translates visual frames into dense textual descriptions. (2). Guessing by LLM generate hypotheses ($\mathcal{H}$) and corresponding sub-events (steps). (3). Selecting frames using CLIP reduce irrelevant frames. (4). Inferring final answer by using selected frames with the "see" & "guess" process again. Best viewed on computer full screen.
  • Figure 2: Qualitative example of goal inference on CrossTask video. More qualitative examples are provided in supplementary.
  • Figure 3: Prompt for Llama3 to judge correctness between the generated inferences and ground truth.
  • Figure 4: Qualitative example of goal inference by VidTFS (V13B) framework on CrossTask video ($\rho$ = 50%). We demonstrate the frames selection process of the evidence selector which leads to better hypotheses and final inference: "Cooking Steaks on a Grill" vs ground truth: "Grill Steak" (obtain 86.3 SBERT score). We can see the selected frames are more relevant to the grill with charcoal and steak after frame selection process.
  • Figure 5: Qualitative example of goal inference by VidTFS (V13B) framework on CrossTask video ($\rho$ = 50%). We can noticed the initial sampled frames that related to a man with beard are filtered out after frame selection process as it is not relevant to the goal. We also can find the inference direction shift from salad only to taco salad related after matching the frames with the hypothesized steps that contained of taco or nachos related steps.
  • ...and 2 more figures