Table of Contents
Fetching ...

VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization

Yikun Liu, Yuan Liu, Shangzhe Di, Haicheng Wang, Zhongyin Zhao, Le Tian, Xiao Zhou, Jie Zhou, Jiangchao Yao, Yanfeng Wang, Weidi Xie

TL;DR

VersaViT addresses the observation that vision encoders in multimodal large language models excel at language-grounded tasks but lag on dense, pixel-level perception. It introduces a lightweight, multi-task post-training framework that augments a shared vision backbone with three task heads for VQA/captioning, monocular depth estimation, and image referring segmentation, enabling multi-granularity supervision without full retraining. Across extensive experiments, VersaViT improves VQA performance while substantially boosting dense-feature representations, demonstrating that dense-friendly optimization can be synergistic with language grounding when mediated by lightweight heads. The resulting versatile vision foundation model offers enhanced semantic alignment and pixel-level understanding, with practical benefits for retrieval, 3D reasoning, and cross-task transfer in vision-language systems.

Abstract

Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.

VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization

TL;DR

VersaViT addresses the observation that vision encoders in multimodal large language models excel at language-grounded tasks but lag on dense, pixel-level perception. It introduces a lightweight, multi-task post-training framework that augments a shared vision backbone with three task heads for VQA/captioning, monocular depth estimation, and image referring segmentation, enabling multi-granularity supervision without full retraining. Across extensive experiments, VersaViT improves VQA performance while substantially boosting dense-feature representations, demonstrating that dense-friendly optimization can be synergistic with language grounding when mediated by lightweight heads. The resulting versatile vision foundation model offers enhanced semantic alignment and pixel-level understanding, with practical benefits for retrieval, 3D reasoning, and cross-task transfer in vision-language systems.

Abstract

Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.
Paper Structure (31 sections, 8 equations, 4 figures, 15 tables)

This paper contains 31 sections, 8 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: Overcoming the dense feature limitation of vision backbone within MLLMs. We observe the vision encoder within MLLMs (Top), which yields strong VQA but suboptimal dense features. Conversely, our multi-task collaborative post-training (Bottom) is designed to overcome this limitation by comprehensively enhancing the vision encoder's capabilities.
  • Figure 2: Linear probing performance on depth estimation (NYUv2) and semantic segmentation (ADE20k) across different vision backbones. The results reveal that the qwen-vl-vits exhibit suboptimal performance on these vision-centric benchmarks.
  • Figure 3: Overview of the proposed multi-task collaborative training framework. The proposed framework jointly trains three distinct tasks: VQA and Image Captioning, Monocular Depth Estimation, and Image Referring Segmentation. By incorporating lightweight task heads, this collaborative training strategy is designed to enhance the representational capabilities of the underlying vision backbone.
  • Figure 4: Qualitative examples. Qualitative comparison of VersaViT against Qwen2-VL-ViT across three tasks. For the VQA, we evaluate using Qwen3-8B. The results for semantic segmentation and depth estimation are obtained through linear probing. As shown, our method outperforms the baseline in all these tasks.