Table of Contents
Fetching ...

Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models

Yue Zhang, Hehe Fan, Yi Yang

TL;DR

The paper addresses the inefficiency of prompt-agnostic adapters in multimodal LLMs, where visual tokens are generated without regard to the concrete objects highlighted by the prompt. It introduces a prompt-aware adapter that uses global and local attention to dynamically embed visual inputs according to the prompt, aligning visual cues with textual semantics. Across COCO-QA and the MME benchmark, the approach yields significant improvements in perception and cognition tasks, with ablations confirming the complementary roles of global and local attention. The method reduces the cognitive load on LLMs and enhances robust visual reasoning in complex scenes, offering a practical path toward more reliable multimodal understanding.

Abstract

To bridge the gap between vision and language modalities, Multimodal Large Language Models (MLLMs) usually learn an adapter that converts visual inputs to understandable tokens for Large Language Models (LLMs). However, most adapters generate consistent visual tokens, regardless of the specific objects of interest mentioned in the prompt. Since these adapters distribute equal attention to every detail in the image and focus on the entire scene, they may increase the cognitive load for LLMs, particularly when processing complex scenes. To alleviate this problem, we propose prompt-aware adapters. These adapters are designed with the capability to dynamically embed visual inputs based on the specific focus of the prompt. Specifically, prompt-aware adapters utilize both global and local textual features to capture the most relevant visual clues from the prompt at both coarse and fine granularity levels. This approach significantly enhances the ability of LLMs to understand and interpret visual content. Experiments on various visual question answering tasks, such as counting and position reasoning, demonstrate the effectiveness of prompt-aware adapters.

Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models

TL;DR

The paper addresses the inefficiency of prompt-agnostic adapters in multimodal LLMs, where visual tokens are generated without regard to the concrete objects highlighted by the prompt. It introduces a prompt-aware adapter that uses global and local attention to dynamically embed visual inputs according to the prompt, aligning visual cues with textual semantics. Across COCO-QA and the MME benchmark, the approach yields significant improvements in perception and cognition tasks, with ablations confirming the complementary roles of global and local attention. The method reduces the cognitive load on LLMs and enhances robust visual reasoning in complex scenes, offering a practical path toward more reliable multimodal understanding.

Abstract

To bridge the gap between vision and language modalities, Multimodal Large Language Models (MLLMs) usually learn an adapter that converts visual inputs to understandable tokens for Large Language Models (LLMs). However, most adapters generate consistent visual tokens, regardless of the specific objects of interest mentioned in the prompt. Since these adapters distribute equal attention to every detail in the image and focus on the entire scene, they may increase the cognitive load for LLMs, particularly when processing complex scenes. To alleviate this problem, we propose prompt-aware adapters. These adapters are designed with the capability to dynamically embed visual inputs based on the specific focus of the prompt. Specifically, prompt-aware adapters utilize both global and local textual features to capture the most relevant visual clues from the prompt at both coarse and fine granularity levels. This approach significantly enhances the ability of LLMs to understand and interpret visual content. Experiments on various visual question answering tasks, such as counting and position reasoning, demonstrate the effectiveness of prompt-aware adapters.
Paper Structure (21 sections, 6 equations, 26 figures, 14 tables)

This paper contains 21 sections, 6 equations, 26 figures, 14 tables.

Figures (26)

  • Figure 1: Illustration comparing prompt-unaware and prompt-aware adapters. Left: Prompt-unaware adapter treats visual patches as a kind of words and directly converts these patches into "readable" tokens for LLMs, without considering the specific objects of interest. In this case, whether the question involves "pool" or "drinks," they consistently generate the same tokens and allocate equal attention to every detail in the scene, which may increase the cognitive load for LLMs. Right: Prompt-aware adapter leverages prompts to collect the most relevant visual clues and generate adaptive tokes, thus enhancing the ability of LLMs to understand and interpret visual content.
  • Figure 2: Illustration comparing the cross-attention (left) and proposed (right) adapters. (a) Methods like VisionLLM wang2023visionllm and Flamingo alayrac2022flamingo utilize text features as queries and visual features as keys and values in cross-attention. It assumes that each word in the prompt corresponds to specific regions. The number of converted visual tokens is equal to that of text features. (b) InstructBLIP dai2024instructblip first injects prompt information into learnable queries via self-attention, and then employs cross-attention. It assumes that each query in the learnable queries corresponds to specific regions. The number of converted visual tokens is equal to that of learnable queries. (c) Our adapter comprises global and local attention components. Due to the new attention calculation mechanism used in local attention, the number of converted visual tokens remains unchanged.
  • Figure 3: Illustration of the proposed prompt-aware adapter. The adapter consists of a global attention component and a local attention component. The global attention, integrated into the visual encoder, is designed to capture coarse-grained, prompt-related visual perceptions. Meanwhile, the local attention focuses on refining responses to specific, fine-grained regions of interest.
  • Figure 4: Visualization of prompt-aware global and local attention. Global attention spans the entire prompt content, while local attention concentrates predominantly on the specific object in question.
  • Figure 5: Qualitative results of the proposed method on diverse perception and cognition tasks.
  • ...and 21 more figures