Table of Contents
Fetching ...

ParGo: Bridging Vision-Language with Partial and Global Views

An-Lan Wang, Bin Shan, Wei Shi, Kun-Yu Lin, Xiang Fei, Guozhi Tang, Lei Liao, Can Huang, Jingqun Tang, Wei-Shi Zheng

TL;DR

ParGo tackles the challenge of aligning vision and language in Multimodal Large Language Models by introducing a Partial-Global projector that blends partial and global image views within a fixed visual token budget. The architecture includes a Partial-Global Perception block and a Cascaded Partial Perception block to capture both holistic context and inter-part relations, trained with a novel ParGoCap-1M-PT dataset of detail-captioned images. Empirical results on four MLLM benchmarks show that ParGo substantially outperforms linear and attention-based projectors and achieves large gains in detail-perception tasks, including a notable 259.96-point improvement on MME over Q-Former. The work demonstrates strong generalization across LLMs and supports the importance of detail-rich training data, offering a practical path to more nuanced multimodal understanding in real-world systems.

Abstract

This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the overemphasis on prominent regions. To facilitate the effective training of ParGo, we collect a large-scale detail-captioned image-text dataset named ParGoCap-1M-PT, consisting of 1 million images paired with high-quality captions. Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our ParGo, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our ParGo achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that ParGo significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability.

ParGo: Bridging Vision-Language with Partial and Global Views

TL;DR

ParGo tackles the challenge of aligning vision and language in Multimodal Large Language Models by introducing a Partial-Global projector that blends partial and global image views within a fixed visual token budget. The architecture includes a Partial-Global Perception block and a Cascaded Partial Perception block to capture both holistic context and inter-part relations, trained with a novel ParGoCap-1M-PT dataset of detail-captioned images. Empirical results on four MLLM benchmarks show that ParGo substantially outperforms linear and attention-based projectors and achieves large gains in detail-perception tasks, including a notable 259.96-point improvement on MME over Q-Former. The work demonstrates strong generalization across LLMs and supports the importance of detail-rich training data, offering a practical path to more nuanced multimodal understanding in real-world systems.

Abstract

This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the overemphasis on prominent regions. To facilitate the effective training of ParGo, we collect a large-scale detail-captioned image-text dataset named ParGoCap-1M-PT, consisting of 1 million images paired with high-quality captions. Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our ParGo, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our ParGo achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that ParGo significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability.
Paper Structure (31 sections, 2 equations, 3 figures, 8 tables)

This paper contains 31 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Illustration of the global and partial information. An image can be properly described by the two kinds of information. Globally, this image shows a landscape of an ancient stone fortress. Delve into the partial information, two individuals stand atop the fortress, the wooden gate at the bottom of the fortress is partially open, and so forth.
  • Figure 2: (a). The pipeline of a MLLM with our proposed ParGo as the vision-language projector. First of all, we use a frozen image encoder to extract image features. To better align the pre-trained visual encoder with the LLM, we propose a Partial-Global projector to project the image features using two kinds of tokens i.e., partial and global tokens. Finally, the output partial and global visual tokens, as well as the tokenized text, are fed into the LLM to generate the text output in an auto-regressive manner. Specifically, each Partial-Global projector layer contains a Partial-Global Perception block that utilizes two kinds of tokens to extract the image features. Additionally, to fully consider the relation between different partial regions in an image, a cascaded partial perception block is incorporated to enable interactions between partial tokens in a cascaded manner. (b). A Demonstration of the Partial-Global and the Cascaded Partial Attention mask. It's worth noting that the Partial-Global Attention mask remains the same in different layers, while the Cascaded Partial Attention mask changes across various layers.
  • Figure 3: Case study on the proposed Partial-Global projector (ParGo). In this figure, we select 6 examples to illustrate the superiority of our proposed ParGo in aligning vision and language modalities.