Table of Contents
Fetching ...

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Wencheng Zhu, Yuexin Wang, Hongxuan Li, Pengfei Zhu, Qinghua Hu

TL;DR

VTD-CLIP tackles the challenge of temporal modeling in vision-language video understanding by discretizing visual streams into text-aligned tokens using a frozen CLIP text encoder as a semantic codebook. Frames are quantized to nearest text prototypes $\boldsymbol{c}_k$, with a learnable prompt-driven update mechanism and hard assignment to yield a discrete video representation; a confidence-aware fusion further integrates discrete and frame-level features for robust recognition. The approach preserves cross-modal generalization and zero-shot capabilities while maintaining computational efficiency, demonstrated on HMDB-51, UCF-101, SSv2, and K-400 with competitive results and lower GFLOPs than many baselines. The work highlights interpretable, text-guided discretization as a viable alternative to heavy temporal modeling in video-language tasks and provides code for reproducibility.

Abstract

Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing approaches primarily rely on parameter-efficient fine-tuning of image-text pre-trained models, yet they often suffer from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our method repurposes the frozen text encoder to construct a visual codebook from video class labels due to the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This codebook effectively transforms temporal visual data into textual tokens via feature lookups and offers interpretable video representations through explicit video modeling. Then, to enhance robustness against irrelevant or noisy frames, we introduce a confidence-aware fusion module that dynamically weights keyframes by assessing their semantic relevance via the codebook. Furthermore, our method incorporates learnable text prompts to conduct adaptive codebook updates. Extensive experiments on HMDB-51, UCF-101, SSv2, and Kinetics-400 have validated the superiority of our approach, achieving more competitive improvements over state-of-the-art methods. The code will be publicly available at https://github.com/isxinxin/VTD-CLIP.

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

TL;DR

VTD-CLIP tackles the challenge of temporal modeling in vision-language video understanding by discretizing visual streams into text-aligned tokens using a frozen CLIP text encoder as a semantic codebook. Frames are quantized to nearest text prototypes , with a learnable prompt-driven update mechanism and hard assignment to yield a discrete video representation; a confidence-aware fusion further integrates discrete and frame-level features for robust recognition. The approach preserves cross-modal generalization and zero-shot capabilities while maintaining computational efficiency, demonstrated on HMDB-51, UCF-101, SSv2, and K-400 with competitive results and lower GFLOPs than many baselines. The work highlights interpretable, text-guided discretization as a viable alternative to heavy temporal modeling in video-language tasks and provides code for reproducibility.

Abstract

Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing approaches primarily rely on parameter-efficient fine-tuning of image-text pre-trained models, yet they often suffer from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our method repurposes the frozen text encoder to construct a visual codebook from video class labels due to the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This codebook effectively transforms temporal visual data into textual tokens via feature lookups and offers interpretable video representations through explicit video modeling. Then, to enhance robustness against irrelevant or noisy frames, we introduce a confidence-aware fusion module that dynamically weights keyframes by assessing their semantic relevance via the codebook. Furthermore, our method incorporates learnable text prompts to conduct adaptive codebook updates. Extensive experiments on HMDB-51, UCF-101, SSv2, and Kinetics-400 have validated the superiority of our approach, achieving more competitive improvements over state-of-the-art methods. The code will be publicly available at https://github.com/isxinxin/VTD-CLIP.

Paper Structure

This paper contains 12 sections, 10 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparisons of CLIP-based approaches. The efficacy of temporal modeling in CLIP-based video methods remains debated as a simple average pooling of frame features achieves superior accuracy. This suggests that naive temporal aggregation suffices when video semantics are frame-dominant. Our method replaces temporal modeling with codebook-based discretization, where we recognize actions not by averaging frames but by chunking visual streams into discrete events. Moreover, our method can mitigate the impact of erroneous and noisy frames through frame scoring.
  • Figure 2: The architecture of VTD-CLIP. We first extract frame and text embeddings using pre-trained CLIP encoders and employ text embeddings to construct a visual codebook. Then, we obtain the discrete feature by discretizing visual embeddings through video-to-text discretization. Finally, we produce video features with confidence fusion.
  • Figure 3: Illustration of confidence-aware fusion. The similarity score is obtained from the video-to-text discretization module.
  • Figure 4: An example of GPT-generated descriptions.
  • Figure 5: Visualization results of VTD-CLIP. We randomly sample seven frames per video to visualize their frame-level discrete labels and fusion confidence scores. LP (learnable prompts): " w/o LP " denotes the VTD-CLIP framework employing a non-adaptive static codebook, while our method utilizes an adaptive dynamic codebook with updates. Red box: VTD-CLIP with a dynamic codebook assigns higher confidence to key-frames, which can be easily recognized with low redundant information; Blue box: misclassified frames with lower confidence scores by using a dynamic codebook; Green box: our method using a dynamic codebook produces frame-wise features that are more consistent with the ground truth annotations.
  • ...and 1 more figures