VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization
Yikun Liu, Yuan Liu, Shangzhe Di, Haicheng Wang, Zhongyin Zhao, Le Tian, Xiao Zhou, Jie Zhou, Jiangchao Yao, Yanfeng Wang, Weidi Xie
TL;DR
VersaViT addresses the observation that vision encoders in multimodal large language models excel at language-grounded tasks but lag on dense, pixel-level perception. It introduces a lightweight, multi-task post-training framework that augments a shared vision backbone with three task heads for VQA/captioning, monocular depth estimation, and image referring segmentation, enabling multi-granularity supervision without full retraining. Across extensive experiments, VersaViT improves VQA performance while substantially boosting dense-feature representations, demonstrating that dense-friendly optimization can be synergistic with language grounding when mediated by lightweight heads. The resulting versatile vision foundation model offers enhanced semantic alignment and pixel-level understanding, with practical benefits for retrieval, 3D reasoning, and cross-task transfer in vision-language systems.
Abstract
Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.
