Table of Contents
Fetching ...

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, Leoweiliang

TL;DR

Penguin-VL is presented, whose vision encoder is initialized from a text-only LLM, and it is revealed that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding.

Abstract

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

TL;DR

Penguin-VL is presented, whose vision encoder is initialized from a text-only LLM, and it is revealed that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding.

Abstract

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL
Paper Structure (64 sections, 5 equations, 14 figures, 3 tables)

This paper contains 64 sections, 5 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Benchmark comparison across tasks. Left (Image): Image benchmarks grouped by capability: OCR, Math, and Knowledge. Right (Video): Video benchmarks grouped into Long-form & Temporal and General Understanding. Each sector corresponds to a benchmark, and the radial bars indicate relative performance, where longer bars mean better results. Overall, Penguin-VL 2B achieves excellent performance across all modalities, highlighting clear advantages over existing state-of-the-art opensource models.
  • Figure 2: Comparison of Vision Encoder Training Paradigms for VLMs.. We contrast three distinct approaches for training the vision encoder: (a) Contrastive Training, which relies on contrastive loss between image and text encoders and then frozen the vision encoder during VLM training. While the training logic is simple, this paradigm requires massive datasets, suffers from training instability, and often yields insufficient fine-grained multimodal alignment. (b) Direct LLM Supervision initializes the vision encoder with contrastive pre-trained weights and aligns visual features to a frozen LLM (via language modeling loss). While this enables direct alignment with the LLM feature space, it is highly sensitive to data quality and prone to overfitting to the training image distribution. (c) Penguin-Encoder Training (Ours), which fuses the advantages of previous methods by initializing the vision encoder directly with the weights of a Text LLM. This approach ensures the starting model distributions are close—facilitating easier alignment—while equipping the vision model with rich initial linguistic knowledge and enabling simple, efficient scaling of vision parameters.
  • Figure 3: Method Overview. Penguin-VL adopts a unified architecture for vision understanding. Vision: The vision encoder is initialized from a text-only LLM (Qwen3-0.6B) and equiped with 2D-RoPE and bidirectional attention. To handle long video contexts efficiently, we employ a Temporal Redundancy-Aware (TRA) strategy that dynamically allocates token budgets among key frames and intermediate frames.
  • Figure 4: Multi-granularity video annotation. This figure illustrates the alignment between visual content and textual descriptions across three temporal scales: Dense time-level, Paragraph-level, and Video-level. Semantic entities are color-coded to distinguish Subjects, Actions, Objects & Details, Spatial Context, Mood & Tone, and Knowledge & Text.
  • Figure 5: Data formats for different data types. ❶ For image sequence, we use "\\ n" to separate image tokens from different image; ❷ For video sequence, we use "Time: xxs" to indicate timestamps of each frame, "," to separate different frames, and "\\ n" to separate tokens from different videos; ❸ For streaming video sequence, videos and texts are organized in an interleaved format.
  • ...and 9 more figures