Table of Contents
Fetching ...

Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts

Honglin Li, Yuting Gao, Chenglu Zhu, Jingdong Chen, Ming Yang, Lin Yang

TL;DR

This work introduces Panther, a MLLM that closely adheres to user instruction and locates targets of interests precisely, with the finesse of a black panther, and comprises three integral components: Panther-VE, Panther-Bridge, and Panther-Decoder.

Abstract

Multimodal large language models (MLLMs) are closing the gap to human visual perception capability rapidly, while, still lag behind on attending to subtle images details or locating small objects precisely, etc. Common schemes to tackle these issues include deploying multiple vision encoders or operating on original high-resolution images. Few studies have concentrated on taking the textual instruction into improving visual representation, resulting in losing focus in some vision-centric tasks, a phenomenon we herein termed as Amblyopia. In this work, we introduce Panther, a MLLM that closely adheres to user instruction and locates targets of interests precisely, with the finesse of a black panther. Specifically, Panther comprises three integral components: Panther-VE, Panther-Bridge, and Panther-Decoder. Panther-VE integrates user instruction information at the early stages of the vision encoder, thereby extracting the most relevant and useful visual representations. The Panther-Bridge module, equipped with powerful filtering capabilities, significantly reduces redundant visual information, leading to a substantial savings in training costs. The Panther-Decoder is versatile and can be employed with any decoder-only architecture of LLMs without discrimination. Experimental results, particularly on vision-centric benchmarks, have demonstrated the effectiveness of Panther.

Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts

TL;DR

This work introduces Panther, a MLLM that closely adheres to user instruction and locates targets of interests precisely, with the finesse of a black panther, and comprises three integral components: Panther-VE, Panther-Bridge, and Panther-Decoder.

Abstract

Multimodal large language models (MLLMs) are closing the gap to human visual perception capability rapidly, while, still lag behind on attending to subtle images details or locating small objects precisely, etc. Common schemes to tackle these issues include deploying multiple vision encoders or operating on original high-resolution images. Few studies have concentrated on taking the textual instruction into improving visual representation, resulting in losing focus in some vision-centric tasks, a phenomenon we herein termed as Amblyopia. In this work, we introduce Panther, a MLLM that closely adheres to user instruction and locates targets of interests precisely, with the finesse of a black panther. Specifically, Panther comprises three integral components: Panther-VE, Panther-Bridge, and Panther-Decoder. Panther-VE integrates user instruction information at the early stages of the vision encoder, thereby extracting the most relevant and useful visual representations. The Panther-Bridge module, equipped with powerful filtering capabilities, significantly reduces redundant visual information, leading to a substantial savings in training costs. The Panther-Decoder is versatile and can be employed with any decoder-only architecture of LLMs without discrimination. Experimental results, particularly on vision-centric benchmarks, have demonstrated the effectiveness of Panther.

Paper Structure

This paper contains 33 sections, 13 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: A comparative analysis of visual feature heatmaps between LLaVA (center) and our advanced Panther (right) with deep text-instructed visual prompting. Panther distinctly enhances focused attention on the visual elements targeted by the instruction.
  • Figure 2: Comparison of MLLM Architectures: (a) Unified MLLM (decoder-only) enables early fusion of instructions and images, preserving low-level image features before fusion but lacks pretrained vision-language knowledge. (b) Typical MLLM (encoder-decoder) performs late fusion, leading to the Amblyopia issue where important visual details may be filtered out. (c) Our Panther MLLM convert instructions as visual prompts to guide the visual encoder for extracting instruction-aware feature. Panther strikes a balance between enhancing instruction-specific visual features and retaining the knowledge acquired from pre-training.
  • Figure 3: The overall framework. (a) The Panther MLLM instruction tuning on multi-turn visual QA data, the visual tokens are generated via Panther-VE and pruned via Panther-Bridge. (b) The Panther-VE, aiming at introducing instruction guidance on visual feature, generates instruction-aware visual prompts to focus specific image details. (c) The Panther-Bridge for multi-turn visual tokens pruning, here we only show two turn case as example.
  • Figure 4: Qualitative comparisons for representative scenarios. The purple represents the key objects in instruction user sought, while the blue denotes the correct answer.