VicTR: Video-conditioned Text Representations for Activity Recognition

Kumara Kahatapitiya; Anurag Arnab; Arsha Nagrani; Michael S. Ryoo

VicTR: Video-conditioned Text Representations for Activity Recognition

Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo

TL;DR

The paper tackles video activity recognition by shifting emphasis from purely visual temporal modeling to language-grounded temporal reasoning. It introduces VicTR, which learns video-conditioned text embeddings through token-boosting, cross-modal and temporal attention, and affinity re-weighting, optionally guided by auxiliary semantically-grounded text. Empirical results across HMDB-51, UCF-101, Kinetics-400, and Charades show strong performance gains in few-shot, zero-shot, and long-form settings, with ablations highlighting the importance of updating text embeddings and the benefit of affinity-based logits. The work demonstrates that leveraging language representations for temporal reasoning yields practical improvements with modest compute overhead, and opens avenues for extending text-conditioned video understanding to other tasks such as video VQA.

Abstract

Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image $\rightarrow$ video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.

VicTR: Video-conditioned Text Representations for Activity Recognition

TL;DR

Abstract

video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.

Paper Structure (40 sections, 11 equations, 3 figures, 8 tables)

This paper contains 40 sections, 11 equations, 3 figures, 8 tables.

Introduction
Related Work
Video understanding
Vision-Language Models (VLMs)
Adapting image-text models to video
Background: image-VLMs to video
Video-conditioned Text Representations
Token-boosting
Cross-modal and Temporal attention
Affinity (re-)weighting
Classifier
Discussion on design decisions
Auxiliary semantic information:
Alternative weighting schemes:
Visual-only or Text-only classifiers:
...and 25 more sections

Figures (3)

Figure 1: Video-conditioned Text Representations: Pretrained image-VLMs can generate reasonable visual embeddings for videos (e.g. by temporally-pooling frame embeddings), together with paired text embeddings. However, usually, these text embeddings are not dependent on visual information--- meaning, they are common for every video. Such representations lack the flexibility to align properly in a shared vision-language latent space, when optimized based on a contrastive similarity (i.e., Affinity) w.r.t. all videos. However, with Video-conditioned Text representations that specialize uniquely for each video, we grant more freedom for text embeddings to move in the latent space, and adapt to different scenarios (e.g. more-challenging recognition tasks).
Figure 2: Overview of VicTR: First, we extract image (i.e., frame) and text tokens using a pretrained image-VLM. Next, such tokens go through a joint video-text encoder, generating video tokens and video-conditioned text tokens, based on which, we compute affinity-based logits for classification. Optionally, any semantic concept (given as auxiliary text) can also be processed similarly, to help guide the classifier. This is motivated based on the co-occurrence of semantics (e.g.rope, gym, one-person) and categories-of-interest, i.e., activity classes in our setting (e.g.rope climbing). Here, the color change of text tokens represents the idea of video-conditioning.
Figure 3: Detailed view of VicTR compared to prior art: There exist multiple closely-related work on adapting pretrained image-VLMs to video, such as CLIP4Clip luo2022clip4clip, ActionCLIP wang2021actionclip, CLIP Hitchhiker's bain2022cliphitchhiker, EVL lin2022evl and X-CLIP ma2022xclip. All these follow a common framework (top-left). Text prompts and video frames are first encoded using two-separate encoders, and then fed into a video head to enable temporal reasoning. It is optional to use text tokens within the video head. Often, text information is kept unchanged luo2022clip4clipwang2021actionclip, or even discarded lin2022evl (bottom-left). CLIP Hitchhiker's bain2022cliphitchhiker however, use text as conditioning to generate text-conditioned video embeddings. X-CLIP ma2022xclip--- which is the closest to our method, jointly-optimizes visual and text tokens. But, it provides limited information for text to contrast-against: only temporally-aggregated visual embeddings, showing marginal gains from updating text. In contrast, VicTR allows text to contrast against both fine-grained visual and other text information, while also jointly-optimizing both modalities. We generate video-conditioned text representations, i.e., text uniquely-specialized for each video (refer to Fig. \ref{['fig:concept']}). Our video head consists of three key operations: (1) Token-boosting, (2) Cross-modal attention, and (3) Affinity (re-)weighting (right). Token-boosting creates dedicated text tokens per video and per timestep, weighted by per-frame affinities of a given video. These enable us to model variations of semantics (represented as text) over time. Affinity (re)-weighting highlights or down-plays each text class, grounded on visual information. Such affinity weights are similar to the ones in CLIP radford2021clip training objective, making the optimization more-consistent. Cross-modal attention enables message passing between both visual-textual and textual-textual modes, creating a better contrastive representation. Also, optionally, VicTR can make use of auxiliary semantics (e.g. object, scene, human-subjects) given as visually-grounded text (refer to Fig. \ref{['fig:overview']}). Such auxiliary semantics help align our video-conditioned text embeddings in the latent space.

VicTR: Video-conditioned Text Representations for Activity Recognition

TL;DR

Abstract

VicTR: Video-conditioned Text Representations for Activity Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)