Table of Contents
Fetching ...

VicTR: Video-conditioned Text Representations for Activity Recognition

Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo

TL;DR

The paper tackles video activity recognition by shifting emphasis from purely visual temporal modeling to language-grounded temporal reasoning. It introduces VicTR, which learns video-conditioned text embeddings through token-boosting, cross-modal and temporal attention, and affinity re-weighting, optionally guided by auxiliary semantically-grounded text. Empirical results across HMDB-51, UCF-101, Kinetics-400, and Charades show strong performance gains in few-shot, zero-shot, and long-form settings, with ablations highlighting the importance of updating text embeddings and the benefit of affinity-based logits. The work demonstrates that leveraging language representations for temporal reasoning yields practical improvements with modest compute overhead, and opens avenues for extending text-conditioned video understanding to other tasks such as video VQA.

Abstract

Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image $\rightarrow$ video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.

VicTR: Video-conditioned Text Representations for Activity Recognition

TL;DR

The paper tackles video activity recognition by shifting emphasis from purely visual temporal modeling to language-grounded temporal reasoning. It introduces VicTR, which learns video-conditioned text embeddings through token-boosting, cross-modal and temporal attention, and affinity re-weighting, optionally guided by auxiliary semantically-grounded text. Empirical results across HMDB-51, UCF-101, Kinetics-400, and Charades show strong performance gains in few-shot, zero-shot, and long-form settings, with ablations highlighting the importance of updating text embeddings and the benefit of affinity-based logits. The work demonstrates that leveraging language representations for temporal reasoning yields practical improvements with modest compute overhead, and opens avenues for extending text-conditioned video understanding to other tasks such as video VQA.

Abstract

Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.
Paper Structure (40 sections, 11 equations, 3 figures, 8 tables)

This paper contains 40 sections, 11 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Video-conditioned Text Representations: Pretrained image-VLMs can generate reasonable visual embeddings for videos (e.g. by temporally-pooling frame embeddings), together with paired text embeddings. However, usually, these text embeddings are not dependent on visual information--- meaning, they are common for every video. Such representations lack the flexibility to align properly in a shared vision-language latent space, when optimized based on a contrastive similarity (i.e., Affinity) w.r.t. all videos. However, with Video-conditioned Text representations that specialize uniquely for each video, we grant more freedom for text embeddings to move in the latent space, and adapt to different scenarios (e.g. more-challenging recognition tasks).
  • Figure 2: Overview of VicTR: First, we extract image (i.e., frame) and text tokens using a pretrained image-VLM. Next, such tokens go through a joint video-text encoder, generating video tokens and video-conditioned text tokens, based on which, we compute affinity-based logits for classification. Optionally, any semantic concept (given as auxiliary text) can also be processed similarly, to help guide the classifier. This is motivated based on the co-occurrence of semantics (e.g.rope, gym, one-person) and categories-of-interest, i.e., activity classes in our setting (e.g.rope climbing). Here, the color change of text tokens represents the idea of video-conditioning.
  • Figure 3: Detailed view of VicTR compared to prior art: There exist multiple closely-related work on adapting pretrained image-VLMs to video, such as CLIP4Clip luo2022clip4clip, ActionCLIP wang2021actionclip, CLIP Hitchhiker's bain2022cliphitchhiker, EVL lin2022evl and X-CLIP ma2022xclip. All these follow a common framework (top-left). Text prompts and video frames are first encoded using two-separate encoders, and then fed into a video head to enable temporal reasoning. It is optional to use text tokens within the video head. Often, text information is kept unchanged luo2022clip4clipwang2021actionclip, or even discarded lin2022evl (bottom-left). CLIP Hitchhiker's bain2022cliphitchhiker however, use text as conditioning to generate text-conditioned video embeddings. X-CLIP ma2022xclip--- which is the closest to our method, jointly-optimizes visual and text tokens. But, it provides limited information for text to contrast-against: only temporally-aggregated visual embeddings, showing marginal gains from updating text. In contrast, VicTR allows text to contrast against both fine-grained visual and other text information, while also jointly-optimizing both modalities. We generate video-conditioned text representations, i.e., text uniquely-specialized for each video (refer to Fig. \ref{['fig:concept']}). Our video head consists of three key operations: (1) Token-boosting, (2) Cross-modal attention, and (3) Affinity (re-)weighting (right). Token-boosting creates dedicated text tokens per video and per timestep, weighted by per-frame affinities of a given video. These enable us to model variations of semantics (represented as text) over time. Affinity (re)-weighting highlights or down-plays each text class, grounded on visual information. Such affinity weights are similar to the ones in CLIP radford2021clip training objective, making the optimization more-consistent. Cross-modal attention enables message passing between both visual-textual and textual-textual modes, creating a better contrastive representation. Also, optionally, VicTR can make use of auxiliary semantics (e.g. object, scene, human-subjects) given as visually-grounded text (refer to Fig. \ref{['fig:overview']}). Such auxiliary semantics help align our video-conditioned text embeddings in the latent space.