Table of Contents
Fetching ...

AdaVid: Adaptive Video-Language Pretraining

Chaitanya Patel, Juan Carlos Niebles, Ehsan Adeli

TL;DR

AdaVid tackles the high compute cost of video-language pretraining and long-form video processing by introducing an adaptive transformer architecture whose embedding dimension can be varied per layer during inference. The approach centers on a Modular Adaptive Transformer Layer inspired by Matryoshka Representation Learning, enabling a single model to operate across a range of compute budgets. The authors instantiate AdaVid-EgoVLP for short videos and AdaVid-Agg for long videos, achieving parity with or improvements over EgoVLP and HierVL under various compute conditions, and demonstrating effective scalability to longer frames and videos. This work supports edge and wearable deployments by offering a compute-accuracy trade-off mechanism and a hierarchical aggregation strategy that maintains strong performance with reduced resources. Overall, AdaVid advances practical, flexible video-language pretraining with adaptive capacity decisions across model layers and temporal scales.

Abstract

Contrastive video-language pretraining has demonstrated great success in learning rich and robust video representations. However, deploying such video encoders on compute-constrained edge devices remains challenging due to their high computational demands. Additionally, existing models are typically trained to process only short video clips, often limited to 4 to 64 frames. In this paper, we introduce AdaVid, a flexible architectural framework designed to learn efficient video encoders that can dynamically adapt their computational footprint based on available resources. At the heart of AdaVid is an adaptive transformer block, inspired by Matryoshka Representation Learning, which allows the model to adjust its hidden embedding dimension at inference time. We show that AdaVid-EgoVLP, trained on video-narration pairs from the large-scale Ego4D dataset, matches the performance of the standard EgoVLP on short video-language benchmarks using only half the compute, and even outperforms EgoVLP when given equal computational resources. We further explore the trade-off between frame count and compute on the challenging Diving48 classification benchmark, showing that AdaVid enables the use of more frames without exceeding computational limits. To handle longer videos, we also propose a lightweight hierarchical network that aggregates short clip features, achieving a strong balance between compute efficiency and accuracy across several long video benchmarks.

AdaVid: Adaptive Video-Language Pretraining

TL;DR

AdaVid tackles the high compute cost of video-language pretraining and long-form video processing by introducing an adaptive transformer architecture whose embedding dimension can be varied per layer during inference. The approach centers on a Modular Adaptive Transformer Layer inspired by Matryoshka Representation Learning, enabling a single model to operate across a range of compute budgets. The authors instantiate AdaVid-EgoVLP for short videos and AdaVid-Agg for long videos, achieving parity with or improvements over EgoVLP and HierVL under various compute conditions, and demonstrating effective scalability to longer frames and videos. This work supports edge and wearable deployments by offering a compute-accuracy trade-off mechanism and a hierarchical aggregation strategy that maintains strong performance with reduced resources. Overall, AdaVid advances practical, flexible video-language pretraining with adaptive capacity decisions across model layers and temporal scales.

Abstract

Contrastive video-language pretraining has demonstrated great success in learning rich and robust video representations. However, deploying such video encoders on compute-constrained edge devices remains challenging due to their high computational demands. Additionally, existing models are typically trained to process only short video clips, often limited to 4 to 64 frames. In this paper, we introduce AdaVid, a flexible architectural framework designed to learn efficient video encoders that can dynamically adapt their computational footprint based on available resources. At the heart of AdaVid is an adaptive transformer block, inspired by Matryoshka Representation Learning, which allows the model to adjust its hidden embedding dimension at inference time. We show that AdaVid-EgoVLP, trained on video-narration pairs from the large-scale Ego4D dataset, matches the performance of the standard EgoVLP on short video-language benchmarks using only half the compute, and even outperforms EgoVLP when given equal computational resources. We further explore the trade-off between frame count and compute on the challenging Diving48 classification benchmark, showing that AdaVid enables the use of more frames without exceeding computational limits. To handle longer videos, we also propose a lightweight hierarchical network that aggregates short clip features, achieving a strong balance between compute efficiency and accuracy across several long video benchmarks.

Paper Structure

This paper contains 18 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: A single AdaVid-trained video model facilitates inference with controllable computational footprint without any postprocessing. It allows one model to adjust its computational demands dynamically according to the requirements, thereby eliminating the need to train multiple distinct models.
  • Figure 2: AdaVid Framework is designed to train video encoders that facilitate adaptive compute-efficient inference. (a) Key component of AdaVid is the Adaptive Transformer Layer, which is designed to handle input tokens of varying dimension sizes up to $D$. During each training iteration, each layer processes the input tokens with a randomly selected dimension size, enforcing a coarse-to-fine structure in the model's weights and activations. This allows an AdaVid-trained model to perform inference with a controllable compute footprint. (b) The feedforward layer $W_2\;\sigma(W_1x+b_1)+b_2$ of the transformer can be modified to accommodate input tokens of size $D/2$ by appropriately slicing the weight and bias parameters. This approach is also applicable to the affine transformation of layer normalization. (c) In multi-head attention, input tokens of size $D/2$ are processed using half the number of heads, rather than reducing the dimension of each head.
  • Figure 3: AdaVid-EgoVLP on two EgoMCQ benchmarks: AdaVid-EgoVLP-dec, trained with decreasing dimensions for deeper layers, performs better than AdaVid-EgoVLP-inc which was trained with increasing dimensions. AdaVid-EgoVLP-dec performs better than baselines while using maximum compute resources. The same model also retains high accuracy when evaluated with low compute evaluation configurations from \ref{['tab:eval_configs']}.
  • Figure 4: Results on Diving-48: We evaluate AdaVid using various evaluation configurations from \ref{['tab:eval_configs']} with 64 and 128 frames. With adaptive compute, AdaVid can process more frames efficiently, outperforming vanilla-trained baselines.
  • Figure 5: Results on SummaryMCQ: AdaVid-Agg achieves comparable performance to HierVL baselines with full embedding dimensions, while also demonstrating robust performance with significantly reduced computational resources as needed.
  • ...and 4 more figures