Table of Contents
Fetching ...

PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures

Tianxiang Wu, Minxin Nie, Ziqiang Cao

TL;DR

This work proposes PIP-MM, a framework that incorporates prompt information into the visual encoding process using existing modules of MLLMs, and maintains excellent generation results even when half of the visual tokens are reduced.

Abstract

The Multimodal Large Language Models (MLLMs) have activated the capabilitiesof Large Language Models (LLMs) in solving visual-language tasks by integratingvisual information. The prevailing approach in existing MLLMs involvesemploying an image encoder to extract visual features, converting thesefeatures into visual tokens via an adapter, and then integrating them with theprompt into the LLM. However, because the process of image encoding isprompt-agnostic, the extracted visual features only provide a coarsedescription of the image, impossible to focus on the requirements of theprompt. On one hand, it is easy for image features to lack information aboutthe prompt-specified objects, resulting in unsatisfactory responses. On theother hand, the visual features contain a large amount of irrelevantinformation, which not only increases the burden on memory but also worsens thegeneration effectiveness. To address the aforementioned issues, we propose\textbf{PIP-MM}, a framework that \textbf{P}re-\textbf{I}ntegrates\textbf{P}rompt information into the visual encoding process using existingmodules of MLLMs. Specifically, We utilize the frozen LLM in the MLLM tovectorize the input prompt, which summarizes the requirements of the prompt.Then, we input the prompt vector into our trained Multi-Layer Perceptron (MLP)to align with the visual input requirements, and subsequently replace the classembedding in the image encoder. Since our model only requires adding atrainable MLP, it can be applied to any MLLM. To validate the effectiveness ofPIP-MM, we conducted experiments on multiple benchmarks. Automated evaluationmetrics and manual assessments demonstrate the strong performance of PIP-MM.Particularly noteworthy is that our model maintains excellent generationresults even when half of the visual tokens are reduced.

PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures

TL;DR

This work proposes PIP-MM, a framework that incorporates prompt information into the visual encoding process using existing modules of MLLMs, and maintains excellent generation results even when half of the visual tokens are reduced.

Abstract

The Multimodal Large Language Models (MLLMs) have activated the capabilitiesof Large Language Models (LLMs) in solving visual-language tasks by integratingvisual information. The prevailing approach in existing MLLMs involvesemploying an image encoder to extract visual features, converting thesefeatures into visual tokens via an adapter, and then integrating them with theprompt into the LLM. However, because the process of image encoding isprompt-agnostic, the extracted visual features only provide a coarsedescription of the image, impossible to focus on the requirements of theprompt. On one hand, it is easy for image features to lack information aboutthe prompt-specified objects, resulting in unsatisfactory responses. On theother hand, the visual features contain a large amount of irrelevantinformation, which not only increases the burden on memory but also worsens thegeneration effectiveness. To address the aforementioned issues, we propose\textbf{PIP-MM}, a framework that \textbf{P}re-\textbf{I}ntegrates\textbf{P}rompt information into the visual encoding process using existingmodules of MLLMs. Specifically, We utilize the frozen LLM in the MLLM tovectorize the input prompt, which summarizes the requirements of the prompt.Then, we input the prompt vector into our trained Multi-Layer Perceptron (MLP)to align with the visual input requirements, and subsequently replace the classembedding in the image encoder. Since our model only requires adding atrainable MLP, it can be applied to any MLLM. To validate the effectiveness ofPIP-MM, we conducted experiments on multiple benchmarks. Automated evaluationmetrics and manual assessments demonstrate the strong performance of PIP-MM.Particularly noteworthy is that our model maintains excellent generationresults even when half of the visual tokens are reduced.

Paper Structure

This paper contains 18 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Compared to existing open-source SOTA models, PIP-MM performs on multiple visual-language task benchmarks.
  • Figure 2: The performance of humans, some high-performing MLLMs, and PIP-MM under the confusion mode. The highlighted green portion represents the part where the image is correctly identified and the question is answered, while the highlighted red portion represents the parts where the model answers incorrectly or cannot be recognized.
  • Figure 3: The comparison of the mainstream approach for integrating prompts in current MLLMs and PIP-MM. Classic architectures, such as InstructBLIP, do not integrate textual information during the visual encoding process; instead, they incorporate it within an Adapter, which fails to address the issue of missing visual information. In contrast, PIP-MM employs the inner LLM and an MLP layer to summarize the query information into a vector that replaces the image encoder's CLS token, achieving an early fusion of text and image.
  • Figure 4: Elimination experiments to assess the impact of training data. (a), (b), and (c) correspond to the results of MM-Vet, MME, and MMMU, respectively.
  • Figure 5: Attention visualization. The red box in the original image represents the object mentioned in the prompt. The highlighted part in the attention map represents the portion of visual tokens that the model focuses on describing.