CoPE-VideoLM: Codec Primitives For Efficient Video Language Models
Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu
TL;DR
CoPE-VideoLM addresses the inefficiency of dense RGB-frame processing in Video Language Models by exploiting compressed-domain codec primitives (motion vectors and residuals) to tokenize videos more sparsely. It introduces a Delta-Encoder to produce compact tokens from P-frames, aligned with image embeddings via a two-stage pre-training and end-to-end fine-tuning with a VideoLM. The approach yields dramatic efficiency gains, reducing time-to-first-token by up to $86\%$ and token usage by up to $93\%$, while maintaining or surpassing state-of-the-art open-source performance across 14 benchmarks spanning general QA, temporal reasoning, long-form understanding, and spatial scene understanding. These results demonstrate that codec-aware tokenization enables long-context video understanding at a fraction of the computational cost, with flexible trade-offs between keyframe density and codec-primitives. The work points to practical implications for real-time video understanding and motivates extensions to handle B-frames and richer codec primitives.
Abstract
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.
