VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Wencheng Zhu; Yuexin Wang; Hongxuan Li; Pengfei Zhu; Qinghua Hu

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Wencheng Zhu, Yuexin Wang, Hongxuan Li, Pengfei Zhu, Qinghua Hu

TL;DR

VTD-CLIP tackles the challenge of temporal modeling in vision-language video understanding by discretizing visual streams into text-aligned tokens using a frozen CLIP text encoder as a semantic codebook. Frames are quantized to nearest text prototypes $\boldsymbol{c}_k$, with a learnable prompt-driven update mechanism and hard assignment to yield a discrete video representation; a confidence-aware fusion further integrates discrete and frame-level features for robust recognition. The approach preserves cross-modal generalization and zero-shot capabilities while maintaining computational efficiency, demonstrated on HMDB-51, UCF-101, SSv2, and K-400 with competitive results and lower GFLOPs than many baselines. The work highlights interpretable, text-guided discretization as a viable alternative to heavy temporal modeling in video-language tasks and provides code for reproducibility.

Abstract

Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing approaches primarily rely on parameter-efficient fine-tuning of image-text pre-trained models, yet they often suffer from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our method repurposes the frozen text encoder to construct a visual codebook from video class labels due to the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This codebook effectively transforms temporal visual data into textual tokens via feature lookups and offers interpretable video representations through explicit video modeling. Then, to enhance robustness against irrelevant or noisy frames, we introduce a confidence-aware fusion module that dynamically weights keyframes by assessing their semantic relevance via the codebook. Furthermore, our method incorporates learnable text prompts to conduct adaptive codebook updates. Extensive experiments on HMDB-51, UCF-101, SSv2, and Kinetics-400 have validated the superiority of our approach, achieving more competitive improvements over state-of-the-art methods. The code will be publicly available at https://github.com/isxinxin/VTD-CLIP.

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

TL;DR

Abstract

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)