Table of Contents
Fetching ...

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi Zhang, Kevin Zhang, Maurice Chong, Shanghang Zhang

TL;DR

This work proposes the Mutually Reinforced Multimodal Large Language Model (MR-MLLM), a novel framework that synergistically enhances visual perception and multimodal comprehension and vision perception synergistically and proposes the perception-enhanced cross-modal integration method.

Abstract

In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation. However, MLLMs mainly focus on high-level image-text interpretations and struggle with fine-grained visual understanding, and vision perception models usually suffer from open-world distribution shifts due to their limited model capacity. To overcome these challenges, we propose the Mutually Reinforced Multimodal Large Language Model (MR-MLLM), a novel framework that synergistically enhances visual perception and multimodal comprehension. First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models, enhancing multimodal comprehension and vision perception synergistically. Second, we propose the perception-enhanced cross-modal integration method, incorporating novel modalities from vision perception outputs, like object detection bounding boxes, to capture subtle visual elements, thus enriching the understanding of both visual and textual data. In addition, an innovative perception-embedded prompt generation mechanism is proposed to embed perceptual information into the language model's prompts, aligning the responses contextually and perceptually for a more accurate multimodal interpretation. Extensive experiments demonstrate MR-MLLM's superior performance in various multimodal comprehension and vision perception tasks, particularly those requiring corner case vision perception and fine-grained language comprehension.

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

TL;DR

This work proposes the Mutually Reinforced Multimodal Large Language Model (MR-MLLM), a novel framework that synergistically enhances visual perception and multimodal comprehension and vision perception synergistically and proposes the perception-enhanced cross-modal integration method.

Abstract

In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation. However, MLLMs mainly focus on high-level image-text interpretations and struggle with fine-grained visual understanding, and vision perception models usually suffer from open-world distribution shifts due to their limited model capacity. To overcome these challenges, we propose the Mutually Reinforced Multimodal Large Language Model (MR-MLLM), a novel framework that synergistically enhances visual perception and multimodal comprehension. First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models, enhancing multimodal comprehension and vision perception synergistically. Second, we propose the perception-enhanced cross-modal integration method, incorporating novel modalities from vision perception outputs, like object detection bounding boxes, to capture subtle visual elements, thus enriching the understanding of both visual and textual data. In addition, an innovative perception-embedded prompt generation mechanism is proposed to embed perceptual information into the language model's prompts, aligning the responses contextually and perceptually for a more accurate multimodal interpretation. Extensive experiments demonstrate MR-MLLM's superior performance in various multimodal comprehension and vision perception tasks, particularly those requiring corner case vision perception and fine-grained language comprehension.
Paper Structure (20 sections, 8 equations, 8 figures, 7 tables)

This paper contains 20 sections, 8 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Limitations of current MLLMs. (1) Limited interaction between two modalities. (2) Acquisition of instance-level descriptors. (3) The enhancement of visual tasks using linguistic knowledge remains under-explored. The caption above is generated from LLaMA-Adapter V2 gao2023llamaadapter. It fails to perceive fine-grained details within the images.
  • Figure 2: Limitations of perception models in Conercases. Detection outcomes above stem from DETR DETR. It lacks generalization capabilities and world knowledge in scene interpretation.
  • Figure 3: Capabilities of MR-MLLM. Through unified-format visual instruction tuning, MR-MLLM is capable of a range of vision-language tasks like Image Captioning and Visual Question Answering (VQA). It refines the detection outcomes of object detection heads, leveraging the world knowledge and generalization capabilities of LLMs to assist vision tasks. Concurrently, this refinement process also enhances the model's fine-grained perception abilities.
  • Figure 4: Pipeline of MR-MLLM.(a) General Pipeline. We employ a detection head encoder and a pre-trained CLIP encoder to extract object descriptors and scene descriptors from images, respectively. Queries containing semantic information at different scales from these encoders are aligned with adapter queries via an MLP layer and then added to them, infusing visual modality information into LLaMA. During the training process, the original parameters of LLaMA are frozen. (b) Perception Forward Block. The fine-grained object information output by the transformer-based object detection head is converted into a textual template, which is then tokenized into queries by the tokenizer of LLaMA. (c) Visual Forward Block. We introduce learnable shared queries to bridge the gap between visual perception and multimodal comprehension. Shared queries are updated during training.
  • Figure 5: MR-MLLM vs GPT-4V gpt4llm. Due to the instance-level object descriptor provided by object detention head, MR-MLLM performs better than the mighty GPT-4V in some context involving spatial reasoning and fine-grained object perception.
  • ...and 3 more figures