Table of Contents
Fetching ...

VidLA: Video-Language Alignment at Scale

Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrishnan, Son Tran, Benjamin Z. Yao, Belinda Zeng, Mubarak Shah, Trishul Chilimbi

TL;DR

VidLA tackles video-language alignment at scale by combining a simple two-tower architecture with hierarchical temporal attention to capture multi-scale temporal dependencies, while initializing from pretrained image-text encoders. It builds a large, semantically grounded training corpus (YT-VidLA-800M) using LLM-driven caption generation and subtitle summarization across clips of varying durations, enabling effective training with long-range semantics. Empirically, VidLA achieves state-of-the-art retrieval on multiple benchmarks, with pronounced gains for longer videos, and shows competitive classification results, demonstrating the utility of scalable data curation and a lightweight yet expressive temporal model. The work highlights the practical impact of leveraging LLM-based data augmentation and hierarchical attention to scale video-language alignment.

Abstract

In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation, we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner, accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture, we are able to initialize our video-language model with pretrained image-text foundation models, thereby boosting the final performance. Second, existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it, we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore, unlike existing video-text datasets which only contain short clips, our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall, empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks, especially on longer videos, and performs competitively on classification benchmarks.

VidLA: Video-Language Alignment at Scale

TL;DR

VidLA tackles video-language alignment at scale by combining a simple two-tower architecture with hierarchical temporal attention to capture multi-scale temporal dependencies, while initializing from pretrained image-text encoders. It builds a large, semantically grounded training corpus (YT-VidLA-800M) using LLM-driven caption generation and subtitle summarization across clips of varying durations, enabling effective training with long-range semantics. Empirically, VidLA achieves state-of-the-art retrieval on multiple benchmarks, with pronounced gains for longer videos, and shows competitive classification results, demonstrating the utility of scalable data curation and a lightweight yet expressive temporal model. The work highlights the practical impact of leveraging LLM-based data augmentation and hierarchical attention to scale video-language alignment.

Abstract

In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation, we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner, accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture, we are able to initialize our video-language model with pretrained image-text foundation models, thereby boosting the final performance. Second, existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it, we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore, unlike existing video-text datasets which only contain short clips, our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall, empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks, especially on longer videos, and performs competitively on classification benchmarks.
Paper Structure (20 sections, 7 equations, 5 figures, 17 tables)

This paper contains 20 sections, 7 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: Recall@1 performance on retrieval benchmarks compared to previous SoTA with ViT-B scale models.
  • Figure 2: Figure summarizing our video-language alignment training approach with a two-tower architecture, where text encoder and video encoder with hierarchical temporal attention are trained with info-NCE losses to align video representations with subtitle and caption text representations simultaneously. We generate the captions using a multi-modal LLM and utilize an LLM to summarize the caption and subtitle texts.
  • Figure 3: Figure summarizing the different tokens and the attention mechanisms used to update the tokens in our proposed Hierarchical Temporal Attention. This toy example uses $N=4$ patches, $T=4$ frames, $U=2$ levels of temporal hierarchy , $V=1$[mst] token per level and temporal scale $r=2$. Hierarchical temporal attention can be factorized into two parts. Spatially Local Temporal Attention (left): Patch tokens only attend to its neighbors across time. For instance, first patch token of the first frame gets updated by only attending to the first patch token of all the other frames. Global Spatio-temporal Attention (right): To capture global spatio-temporal semantics efficiently, we update the patch tokens by attending to other patch tokens from the same frame as well as all the [mst] tokens. The third and fourth column depict the hierarchical [mst] token update mechanism. Particularly, from the third column we observe that [mst]-0 gets updated by attending to all the patch tokens and other [mst] tokens of lower temporal resolution. The next column demonstrates the multi-scale [mst] attention mechanism where the second [mst] token, [mst]-1, only attends to patch tokens from a subset of frames with a higher stride. The [cls] token acts as an aggregator and attentively pulls information from both [mst] and patch tokens.
  • Figure 4: Retrieval performance on MSR-VTT compared to other attention mechanisms Left: R@1 numbers for validation videos separated into 3 bins of different durations. VidLA consistently improves over baselines for all video durations. Right: Scaling up the pretraining dataset improves the performance. Our architecture improves over other attention mechanisms at all data scales.
  • Figure 5: Effect of image-language pretraining on MSR-VTT.