Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

Songtao Jiang; Yan Zhang; Chenyi Zhou; Yeying Jin; Yang Feng; Jian Wu; Zuozhu Liu

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

Songtao Jiang, Yan Zhang, Chenyi Zhou, Yeying Jin, Yang Feng, Jian Wu, Zuozhu Liu

TL;DR

This work introduces VTPrompt, a joint visual and text prompting framework designed to enhance object-centric perception in multimodal large language models. By extracting key concepts from questions, generating targeted visual prompts via a detector, and guiding answer synthesis with structured text prompts, VTPrompt improves object localization, spatial relations, and attribute understanding in VQA. Experiments on MME, MMB, and POPE show substantial gains for GPT-4V and Gemini Pro, achieving state-of-the-art performance on MMB and large improvements on MME. The results demonstrate that coordinated visual grounding and textual guidance can significantly bridge the gap toward human-level perception in multimodal reasoning, while also revealing areas for further robustness and grounding accuracy improvements.

Abstract

Multimodal Large Language Models (MLLMs) such as GPT-4V and Gemini Pro face challenges in achieving human-level perception in Visual Question Answering (VQA), particularly in object-oriented perception tasks which demand fine-grained understanding of object identities, locations or attributes, as indicated by empirical findings. This is mainly due to their limited capability to effectively integrate complex visual cues with textual information and potential object hallucinations. In this paper, we present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA, especially for object-oriented perception. VTPrompt merges visual and text prompts to extract key concepts from textual questions and employs a detection model to highlight relevant objects as visual prompts in images. The processed images alongside text prompts are subsequently fed into MLLMs to produce more accurate answers. Our experiments with GPT-4V and Gemini Pro, on three benchmarks, i.e., MME , MMB and POPE, demonstrate significant improvements. Particularly, our method led to a score improvement of up to 183.5 for GPT-4V on MME and enhanced MMB performance by 8.17\% for GPT-4V and 15.69\% for Gemini Pro.

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (14 sections, 8 equations, 13 figures, 5 tables)

This paper contains 14 sections, 8 equations, 13 figures, 5 tables.

Introduction
Method
VQA
Key Concepts Extraction
VPrompt Generation
TPrompt for Answer Generation
Experiement Setup
Main Results
Analysis
Related Work
VQA with MLLMs
Visual Perception With MLLMs
Conclusion
Example Appendix

Figures (13)

Figure 1: Performance of Gemini Pro geminiteam2023gemini on MMB liu2023mmbench. The inferior performance on the three object-oriented tasks (left-most) can be boosted with our VTPrompt.We also present the results based on GPT-4V in the Appendix Figure \ref{['fig:mmb4v']}.
Figure 2: (a) Regular VQA with GPT-4V generating wrong answers. (b-c) Pipeline of our VTPrompt. The represents the Key Concepts Extraction, corresponding to Section \ref{['sec:2.21']}, and the illustrates the VPrompt Generation, as detailed in Section \ref{['sec:2.3']}. The generated image with visual markers from (b) are processed in (c) which focuses on TPrompt for Answer Generation as in Section \ref{['sec:2.4']}, where the image enhanced with visual and text prompts are combined and fed into GPT-4V to produce the answers, as indicated by .
Figure 3: Performance of GPT-4V and Gemini Pro on Object-Oriented Perception Tasks in MMB and MME Benchmarks.
Figure 4: Distribution of wrong cases.
Figure 5: Wrong cases of incorrect information extraction within visual prompts. This figure shows errors when models fail to extract correct information even though objects are accurately marked. It compares results with original vs Tprompts from our methods. Red indicates where models error, and green shows correct interpretations. This highlights how our optimized prompts improve models' ability to accurately use visual prompts for correct answers.
...and 8 more figures

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

TL;DR

Abstract

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)