ParGo: Bridging Vision-Language with Partial and Global Views
An-Lan Wang, Bin Shan, Wei Shi, Kun-Yu Lin, Xiang Fei, Guozhi Tang, Lei Liao, Can Huang, Jingqun Tang, Wei-Shi Zheng
TL;DR
ParGo tackles the challenge of aligning vision and language in Multimodal Large Language Models by introducing a Partial-Global projector that blends partial and global image views within a fixed visual token budget. The architecture includes a Partial-Global Perception block and a Cascaded Partial Perception block to capture both holistic context and inter-part relations, trained with a novel ParGoCap-1M-PT dataset of detail-captioned images. Empirical results on four MLLM benchmarks show that ParGo substantially outperforms linear and attention-based projectors and achieves large gains in detail-perception tasks, including a notable 259.96-point improvement on MME over Q-Former. The work demonstrates strong generalization across LLMs and supports the importance of detail-rich training data, offering a practical path to more nuanced multimodal understanding in real-world systems.
Abstract
This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the overemphasis on prominent regions. To facilitate the effective training of ParGo, we collect a large-scale detail-captioned image-text dataset named ParGoCap-1M-PT, consisting of 1 million images paired with high-quality captions. Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our ParGo, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our ParGo achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that ParGo significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability.
