Table of Contents
Fetching ...

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng

TL;DR

OneVision-Encoder reframes visual understanding as a predictive compression problem by aligning transformer-based vision with the intrinsic structure of video signals. It introduces Codec Patchification to selectively encode 3.1%–25% of patches based on HEVC-derived motion and residual cues, supported by a unified 3D-RoPE for irregular spatiotemporal layouts and a large-scale cluster-discrimination training objective. The approach is validated through two-stage pretraining on image, video, and OCR data, followed by extensive LMM probing and attentive probing, where OV-Encoder consistently surpasses strong baselines under fixed token budgets and shows state-of-the-art representation quality on multiple benchmarks. The work demonstrates that codec-aligned patch sparsity yields a scalable, high-performance foundation for universal multimodal visual intelligence, with practical implications for efficient video understanding and vision-language modeling.

Abstract

Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics. Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

TL;DR

OneVision-Encoder reframes visual understanding as a predictive compression problem by aligning transformer-based vision with the intrinsic structure of video signals. It introduces Codec Patchification to selectively encode 3.1%–25% of patches based on HEVC-derived motion and residual cues, supported by a unified 3D-RoPE for irregular spatiotemporal layouts and a large-scale cluster-discrimination training objective. The approach is validated through two-stage pretraining on image, video, and OCR data, followed by extensive LMM probing and attentive probing, where OV-Encoder consistently surpasses strong baselines under fixed token budgets and shows state-of-the-art representation quality on multiple benchmarks. The work demonstrates that codec-aligned patch sparsity yields a scalable, high-performance foundation for universal multimodal visual intelligence, with practical implications for efficient video understanding and vision-language modeling.

Abstract

Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics. Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.
Paper Structure (57 sections, 9 equations, 11 figures, 9 tables)

This paper contains 57 sections, 9 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Visual intelligence as codec-aligned predictive compression. Visual intelligence as a compression problem, where scalable learning emerges from alignment with the predictive structure of the world. Video exemplifies this principle: most visual content is redundant and predictable, while meaningful information arises sparsely as motion and residual change. Video codecs make this structure explicit by decomposing visual signals into stable spatial context and sparse temporal updates. Grounded in this codec principle, OV-Encoder reframes visual modeling as predictive compression, serving as a scalable engine for universal multimodal intelligence that sees, updates, and reasons over time.
  • Figure 2: Overview of the OneVision-Encoder framework. Left: Input formulation. The framework integrates three Codec Patchification strategies: Dense Video-Codec Patchification, Chunk-wise Patchification, and (Sigle-Image/Frame) Spatial Patchification. All inputs are processed by a shared-parameter OneVision-Encoder. Right: Unified cluster discrimination objective. Image and video embeddings are aligned through contrastive learning against a global set of class centers, jointly optimizing object-level and action-level representations within a single encoder.
  • Figure 3: Contrastive learning vs. cluster discrimination. (a) Standard contrastive learning contrasts samples against batch-local negatives, constraining the view of the embedding space. (b) Cluster discrimination contrasts samples against a global concept bank of clustered centers at scale, yielding discriminative and structurally separated representations.
  • Figure 4: 3D-RoPE for Codec Patchification. A unified relative positional encoding scheme is adopted for Codec Patchification. (a) encodes full spatiotemporal offsets $(\Delta t,\Delta x,\Delta y)$ over I/P-frame sequences to preserve motion-driven inter-frame structure. (b) defines temporal offsets at the chunk level, enabling structured reasoning under non-uniform temporal sampling. (c) degenerates the formulation to purely spatial offsets $(0,\Delta x,\Delta y)$ for static inputs. 3D-RoPE preserves structural consistency, enabling coherent attention over sparse and irregular token layouts.
  • Figure 5: Visualization of I- and P-frame decomposition in HEVC. I-frames retain complete spatial structure, whereas P-frames encode motion-compensated residuals highlighting motion. Bright areas denote high residual magnitudes, while dark areas indicate static content.
  • ...and 6 more figures