Table of Contents
Fetching ...

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu

TL;DR

CoPE-VideoLM addresses the inefficiency of dense RGB-frame processing in Video Language Models by exploiting compressed-domain codec primitives (motion vectors and residuals) to tokenize videos more sparsely. It introduces a Delta-Encoder to produce compact tokens from P-frames, aligned with image embeddings via a two-stage pre-training and end-to-end fine-tuning with a VideoLM. The approach yields dramatic efficiency gains, reducing time-to-first-token by up to $86\%$ and token usage by up to $93\%$, while maintaining or surpassing state-of-the-art open-source performance across 14 benchmarks spanning general QA, temporal reasoning, long-form understanding, and spatial scene understanding. These results demonstrate that codec-aware tokenization enables long-context video understanding at a fraction of the computational cost, with flexible trade-offs between keyframe density and codec-primitives. The work points to practical implications for real-time video understanding and motivates extensions to handle B-frames and richer codec primitives.

Abstract

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

TL;DR

CoPE-VideoLM addresses the inefficiency of dense RGB-frame processing in Video Language Models by exploiting compressed-domain codec primitives (motion vectors and residuals) to tokenize videos more sparsely. It introduces a Delta-Encoder to produce compact tokens from P-frames, aligned with image embeddings via a two-stage pre-training and end-to-end fine-tuning with a VideoLM. The approach yields dramatic efficiency gains, reducing time-to-first-token by up to and token usage by up to , while maintaining or surpassing state-of-the-art open-source performance across 14 benchmarks spanning general QA, temporal reasoning, long-form understanding, and spatial scene understanding. These results demonstrate that codec-aware tokenization enables long-context video understanding at a fraction of the computational cost, with flexible trade-offs between keyframe density and codec-primitives. The work points to practical implications for real-time video understanding and motivates extensions to handle B-frames and richer codec primitives.

Abstract

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to and token usage by up to compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.
Paper Structure (26 sections, 10 equations, 5 figures, 14 tables)

This paper contains 26 sections, 10 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: CoPE-VideoLM is a codec-aware tokenization framework for Video Language Models that replaces dense RGB frame encoding with lightweight structured representations derived from codec primitives. Instead of treating every frame as a full image, the model processes only sparse I-frames through a vision encoder, while P-frames are converted into compact tokens using their motion vectors and residuals. By leveraging the inherent sparsity and structure of standard video codecs, CoPE-VideoLM avoids redundant RGB processing, reduces visual token usage by up to $93\%$, and cuts time-to-first-token by up to $86\%$, compared to standard dense VideoLMs.
  • Figure 2: Overview of our pipeline. Given a video in its raw codec representation, our framework leverages the GOP structure for efficient, codec-aware tokenization. I-frames are processed by a standard frozen vision encoder ($\phi_{\text{RGB}}$) to produce dense RGB tokens. P-frames, however, bypass full RGB decoding. Their raw components, motion vectors and residuals, are instead fed into our lightweight $\Delta$-Encoder ($\phi_{\Delta}$) to generate a small set of highly compact $\Delta$-tokens. The final token stream, an interleaved sequence of I-frame tokens and $\Delta$-tokens, is consumed by the LLM, enabling dense temporal coverage at a fraction of the standard token count and runtime.
  • Figure 3: $\Delta$-encoder processes motion vectors and residuals through two lightweight branches designed to extract and compress codec-domain information. The resulting motion and residual tokens are concatenated to form the $\Delta$-tokens used for P-frames, providing an efficient representation, which is projected to the RGB token space during pre-training.
  • Figure 4: Video length vs. token budget. Theoretical scaling plot showing token efficiency across configurations. The x-axis is logarithmic in token budget, and vertical dashed lines indicate evaluated budgets. Our $\Delta$-token representation enables scaling to significantly longer videos without exceeding context limits.
  • Figure 5: Codec primer. We visualize from left to right: the previous frame, the motion vectors and residuals between previous and current frame, the intermediate reconstruction after motion compensation, and the final result after adding the residuals.