Table of Contents
Fetching ...

TEST-V: TEst-time Support-set Tuning for Zero-shot Video Classification

Rui Yan, Jin Wang, Hongyu Qu, Xiaoyu Du, Dong Zhang, Jinhui Tang, Tieniu Tan

TL;DR

TEST-V addresses the modality gap in zero-shot video classification by integrating test-time, learnable support-set tuning. It combines Multi-prompting Support-set Dilation (MSD) to create semantically diverse support videos with Temporal-aware Support-set Erosion (TSE) to dynamically weight frames and scales, all in a training-free setup. The approach achieves state-of-the-art results on four benchmarks and demonstrates strong generalization across VLM backbones, with ablations showing the complementary gains from diversity and temporal refinement. This framework offers practical, interpretable mechanisms for adapting pre-trained VLMs to unseen video classes in real-world scenarios.

Abstract

Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other's strengths and propose a novel framework namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts enquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. $\textbf{TEST-V}$ achieves state-of-the-art results across four benchmarks and has good interpretability for the support-set dilation and erosion.

TEST-V: TEst-time Support-set Tuning for Zero-shot Video Classification

TL;DR

TEST-V addresses the modality gap in zero-shot video classification by integrating test-time, learnable support-set tuning. It combines Multi-prompting Support-set Dilation (MSD) to create semantically diverse support videos with Temporal-aware Support-set Erosion (TSE) to dynamically weight frames and scales, all in a training-free setup. The approach achieves state-of-the-art results on four benchmarks and demonstrates strong generalization across VLM backbones, with ablations showing the complementary gains from diversity and temporal refinement. This framework offers practical, interpretable mechanisms for adapting pre-trained VLMs to unseen video classes in real-world scenarios.

Abstract

Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other's strengths and propose a novel framework namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts enquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. achieves state-of-the-art results across four benchmarks and has good interpretability for the support-set dilation and erosion.

Paper Structure

This paper contains 32 sections, 9 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Innovation of the zero-shot activity recognition framework. a) Tuning the text input via the given test video in a self-supervised manner. b) Aligning the test video with the label based on the support set from feature similarity or predicted distribution similarity. c) This work combines the above thoughts to construct the support set diversely and tunes this set in a self-supervised manner to mine high-quality support samples.
  • Figure 2: Overview of the proposed framework TeST-V which first dilates and then erodes the support set for zero-shot video classification. i) Multi-prompting Support-set Dilation (MSD): It builds diversified motion description for each class name via the LLM and then generates video samples with these elaborate descriptions via the text-to-video generation model for constructing a diverse support set. ii) Temporal-aware Support-set Erosion (TSE): Based on the visual feature of the given test video $\bm f$ and support set $\bm F$, it applies factorized weights $\bm r_{*}$ to mine critical supporting cues from the support set and tunes the weights with prediction consistency at multiple temporal scales.
  • Figure 3: Effect of support hyper-parameters $K$ and $n$ ($K = m \times n$ defined in the equation \ref{['eq::MSD']}) with a single prompt (SuS-X) and multiple prompts (our MSD) on HMDB-51 and UCF-101. Top-1 zero-shot recognition accuracy is reported.
  • Figure 4: Feature distribution of supporting samples generated with multiple prompts (MSD) and single prompt (SuS-X) on different benchmarks. Multi-prompting and single-prompting samples are shown in color and grey, respectively.
  • Figure 5: Multi-prompting Support-set Dilation (MSD).
  • ...and 5 more figures