Table of Contents
Fetching ...

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

Jinhui Yi, Syed Talal Wasim, Yanan Luo, Muzammal Naseer, Juergen Gall

TL;DR

Video-Panda introduces an encoder-free approach to video-language understanding by leveraging the Spatio-Temporal Alignment Block (STAB), which processes videos directly and aligns them with a language model using a dedicated training pipeline. STAB decomposes video representation into local spatio-temporal encoding (LSTE), learnable downsampling (LSD), and frame-/video-level aggregators (FSRA and GSTRA) to produce frame-specific context tokens that are fused with LLM embeddings. The model is trained in three stages—initial alignment with a frozen LLM on WebVid data, end-to-end visual-language integration, and instruction tuning with video-centric data—employing a distillation loss against a teacher (LanguageBind). Empirically, Video-Panda achieves competitive open-ended and fine-grained VideoQA results across MSVD-QA, MSRVTT-QA, TGIF-QA, and ActivityNet-QA, while using only $45\mathrm{M}$ visual parameters and delivering $3-4\times$ speedups and at least a $6.5\times$ parameter reduction compared to encoder-based baselines, demonstrating the practicality of encoder-free video-language modeling for efficient deployment and longer videos.

Abstract

We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5$\times$ reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks. The fine-grained video question-answering evaluation demonstrates our model's effectiveness, outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key aspects like correctness and temporal understanding. Extensive ablation studies validate our architectural choices and demonstrate the effectiveness of our spatio-temporal modeling approach while achieving 3-4$\times$ faster processing speeds than previous methods. Code is available at https://jh-yi.github.io/Video-Panda.

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

TL;DR

Video-Panda introduces an encoder-free approach to video-language understanding by leveraging the Spatio-Temporal Alignment Block (STAB), which processes videos directly and aligns them with a language model using a dedicated training pipeline. STAB decomposes video representation into local spatio-temporal encoding (LSTE), learnable downsampling (LSD), and frame-/video-level aggregators (FSRA and GSTRA) to produce frame-specific context tokens that are fused with LLM embeddings. The model is trained in three stages—initial alignment with a frozen LLM on WebVid data, end-to-end visual-language integration, and instruction tuning with video-centric data—employing a distillation loss against a teacher (LanguageBind). Empirically, Video-Panda achieves competitive open-ended and fine-grained VideoQA results across MSVD-QA, MSRVTT-QA, TGIF-QA, and ActivityNet-QA, while using only visual parameters and delivering speedups and at least a parameter reduction compared to encoder-based baselines, demonstrating the practicality of encoder-free video-language modeling for efficient deployment and longer videos.

Abstract

We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5 reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks. The fine-grained video question-answering evaluation demonstrates our model's effectiveness, outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key aspects like correctness and temporal understanding. Extensive ablation studies validate our architectural choices and demonstrate the effectiveness of our spatio-temporal modeling approach while achieving 3-4 faster processing speeds than previous methods. Code is available at https://jh-yi.github.io/Video-Panda.

Paper Structure

This paper contains 20 sections, 13 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Model performance on MSVD-QA versus the model size of the visual component in logarithmic scale. The bubble size indicates the amount of finetuning data (in thousands). Models using the same training dataset as ours (100K samples) are shown in dark green, while those using different datasets are in blue.
  • Figure 2: Existing video-language model architectures: From left to right: Early approaches use image encoders for both image and video inputs. The alignment module aligns the embeddings of the visual modality with the language modality. The integration of Q-Former improved this alignment. Instead of a single encoder, dual encoder approaches have separate encoders for images and videos where the alignment block consists of projection layers. The additional encoders, however, make these models very heavy where the alignment module and encoders have at least 300M and sometimes over 1B parameters. In contrast, our encoder-free design (rightmost) directly processes video inputs through a novel spatio-temporal alignment block (STAB). It eliminates the need for heavyweight pretrained encoders and requires less than 50M parameters.
  • Figure 3: Detailed architecture of our Spatio-Temporal Alignment Block (STAB): The input video is first converted into patches. The Local Spatio-Temporal Encoding (LSTE) uses 3D convolutions to model spatio-temporal relations and adds a 3D convolution dynamic position encoding (DPE) to encode position with respect to the local spatio-temporal window. As a result, we obtain per-frame tokens with positional encoding. The tokens are then processed in two ways. While the Global Spatio-Temporal Relationship Aggregator (GSTRA) at the top captures video-level context, the Frame-wise Spatial Relationship Aggregator (FSRA) at the bottom captures spatial context within each frame. To reduce the cost, we perform a Local Spatial Downsampling (LSD) to reduce the spatial dimension for each token. The video-level context tokens and the frame-wise spatial tokens are then linearly combined through learnable weighted fusion ($\alpha$), producing a frame-specific context token. These context tokens are then prepended to their respective frame's flattened spatial tokens, with $\texttt{<row>}$ split tokens inserted to demarcate row boundaries in the spatial layout. This combination of global context and preserved spatial structure enables effective video understanding while maintaining computational efficiency.
  • Figure 4: Qualitative examples showing the impact of removing Frame-wise Spatial Relationship Aggregator (FSRA) and Global Spatio-Temporal Relationship Aggregator (GSTRA).
  • Figure 5: Qualitative comparisons of different design choices of Video-Panda: The figure presents eight video examples with ground truth (GT) annotations and model predictions under different training configurations. The top row demonstrates the effect of 702K training samples in stage 1 (left) and the impact of performing Local Spatial Downsampling (LSD) before Local Spatial-Temporal Encoding (LSTE) (right). The second row shows results from removing LSD while using average pooling (left), half-resolution (right), and perceiver resampler (third row left). The third row right and fourth row illustrate the effects of different teacher models for knowledge distillation: CLIP (third row right), Intern-Video (left), and DINOv2 (right). Each example includes the original model prediction (yellow) and an ablated version (purple), highlighting how architectural and training choices affect Video-Panda's ability to interpret dynamic visual scenes and answer questions. The qualitative examples are from the MSVD-QA dataset.
  • ...and 3 more figures