FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

Peisen Zhao; Xiaopeng Zhang; Mingxing Xu; Ruoyu Sun; Zewei Du; Dunzheng Wang; Guanghao Zheng; Haohang Xu; Zhibo Zhang; Yuhang Zhang; Yi Ai; Lin Liu; Qi Tian

FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

Peisen Zhao, Xiaopeng Zhang, Mingxing Xu, Ruoyu Sun, Zewei Du, Dunzheng Wang, Guanghao Zheng, Haohang Xu, Zhibo Zhang, Yuhang Zhang, Yi Ai, Lin Liu, Qi Tian

Abstract

While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing coarse web data with dense recaptions, we systematically mitigate information loss through a progressive training paradigm.: first, the encoder is trained from scratch at a high native resolution on billions of global recaptioned image-text pairs, establishing a robust, detail rich semantic foundation. Subsequently, we further enhance its local perception through LLM alignment, utilizing our curated FineCap-450M dataset that comprises over $450$ million high quality local captions. Extensive experiments validate the effectiveness of the progressive strategy. FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders such as SigLIP2 and Qwen-ViT when integrated into MLLMs. We hope FineViT could serve as a powerful new baseline for fine-grained visual perception.

FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

Abstract

million high quality local captions. Extensive experiments validate the effectiveness of the progressive strategy. FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders such as SigLIP2 and Qwen-ViT when integrated into MLLMs. We hope FineViT could serve as a powerful new baseline for fine-grained visual perception.

Paper Structure (22 sections, 3 equations, 15 figures, 9 tables)

This paper contains 22 sections, 3 equations, 15 figures, 9 tables.

Introduction
Related Work
Visual Foundation Models
Vision Models Evolution in MLLMs
Methods
Model Architecture
Training Recipe
Training Data
Experiments
Experimental Setup
Main Results
Ablation Study
Conclusion
Data Statistics
Data Statistics of Stage II: 1.56B images
...and 7 more sections

Figures (15)

Figure 1: (a) Construction of rich QA training data from FineCap-450M, highlighting detailed local-level tasks. (b) FineViT performance boost in multimodal tasks enabled by FineCap-450M local-task pre-training.
Figure 2: Illustration of FineViT training stages. Stage I utilizes a Masked Image Modeling (MIM) loss to establish foundational visual perception. Stage II implements a SigLIP loss via large-scale image-text contrastive learning to bridge the gap between visual features and semantic concepts. Finally, Stage III integrates a Large Language Model (LLM) at high resolution to train the model on global and local Question-Answering (QA) tasks, ensuring a robust sensitivity to fine-grained visual details.
Figure 3: The pipeline of data curation and recaption.
Figure 4: Illustration on long text zero-shot retrieval. Distinctive attributes within the long caption are highlighted in orange to help identify the correct image.
Figure 5: Performance distribution: unfrozen vs frozen FineViT.
...and 10 more figures

FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

Abstract

FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

Authors

Abstract

Table of Contents

Figures (15)