Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Xin He; Longhui Wei; Lingxi Xie; Qi Tian

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Xin He, Longhui Wei, Lingxi Xie, Qi Tian

TL;DR

This work tackles the information loss challenge in multimodal large language models by introducing Incorporating Visual Experts (IVE), a mixture-of-experts framework that augments visual perception through three encoders (semantic, low-level, and document-related) and a structural knowledge enhancement module using visual tools. IVE integrates these visual experts into a three-stage training pipeline (pretraining, multi-task instruct tuning, and specific fine-tuning) and employs structural cues as prompts during training and inference to guide the LLM. Across diverse VQA, OCR, and document/chart benchmarks, IVE demonstrates improved visual understanding and competitive or superior performance compared with state-of-the-art approaches, with ablations confirming the contributions of each component and the benefits of training-time knowledge integration. The results suggest that combining multiple specialized visual representations with explicit structural knowledge yields more faithful visual reasoning and robust multimodal dialogue capabilities, with potential impact on broader real-world multimodal interfacing tasks.

Abstract

Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of noteworthy contributions in recent months. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets are collected. However, a prevailing challenge persists in these approaches, specifically in relation to the limited visual perception ability, as CLIP-like encoders employed for extracting visual information from inputs. Though these encoders are pre-trained on billions of image-text pairs, they still grapple with the information loss dilemma, given that textual captions only partially capture the contents depicted in images. To address this limitation, this paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. Specifically, we introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline, aiming to provide a more comprehensive and accurate summarization of visual inputs. Extensive experiments have evaluated its effectiveness of advancing MLLMs, showcasing improved visual perception achieved through the integration of visual experts.

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (29 sections, 6 equations, 5 figures, 8 tables)

This paper contains 29 sections, 6 equations, 5 figures, 8 tables.

Introduction
Related Work
Vision-and-Language Pre-training
Multimodal Large Language Models
Our Approach
Preliminaries
Incorporating Visual Experts into MLLMs
Multi-task Encoders.
Structural Knowledge Enhancement.
Training Pipeline
Stage 1: Pretraining.
Stage 2: Multi-task Instruct Tuning.
Stage 3: Specific Fine-Tuning.
Experiments
Datasets
...and 14 more sections

Figures (5)

Figure 1: Examples from public image-text pairs. (a) Examples from COCO Captionchen2015microsoft. (b) Examples from LLaVA-Instruct-150Kliu2023visual. The short textual captions in (a) make it difficult to comprehensively describe the corresponding image. The captions in (b) are more informative but still cannot describe the entirety of the image. The orange boxes in the image indicate objects that are ignored in the captions.
Figure 2: The illustrations of our proposed approach. Two modules, i.e., the multi-task encoders and structural knowledge enhancement, are specifically designed in our framework. The multi-task encoders integrate multiple types of complementary encoders to collaboratively capture the latent information within visual inputs, i.e., the semantic information encoder, the low-level information encoder and the document-related information encoder, respectively. In the structural knowledge enhancement module, our work mainly utilizes visual tools (RAMzhang2023recognize+GroudingDINOliu2023grounding and EasyOCReasyocr) to detect the instances and textual information inside images as the prior knowledge fed into the large language model.
Figure 3: The qualitative analysis of structural knowledge enhancement on improving spatial awareness ability. A1 represents the result while not integrating structural knowledge, A2 represents the result while integrating structural knowledge in both training and inference stages, and GT represents the ground truth, respectively. The red lines represent the wrong answers and the green lines denote the correct answers.
Figure 4: The visualized analysis of proposed modules in IVE. A1 represents the result of using the semantic information encoder only, A2 represents the result of using both the semantic information encoder and low-level information encoder, A3 represents the result of using all three encoders, A4 denotes the result of further integrating the structured knowledge in the inference phrase, A5 denotes the result of integrating the structured knowledge in both the training and inference phrase. GT represents the ground truth. The red lines represent the wrong answers and the green lines denote the correct answers.
Figure 5: The comparisons among mPLUG-Owl2ye2023mplug2, QWen-VL-Plusbai2023qwen and our method.

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

TL;DR

Abstract

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)