Table of Contents
Fetching ...

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie

TL;DR

This work tackles insufficient visual knowledge extraction in Multimodal Large Language Models by introducing LION, a model that injects dual-level visual knowledge: fine-grained spatial awareness through region-level VL tasks and a soft, high-level semantic cue via image tags. It addresses internal conflicts between image-level and region-level objectives with a stage-wise instruction-tuning strategy and a Mixture-of-Adapters router, coupled with a Vision Aggregator to fuse multi-level vision features. The approach yields state-of-the-art results across image-captioning, VQA, and visual grounding benchmarks, and demonstrates reduced object hallucination and stronger reasoning capabilities on robustness tests. The combination of progressive knowledge incorporation and soft semantic prompting offers a practical path to more reliable and capable MLLMs in real-world multimodal tasks.

Abstract

Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

TL;DR

This work tackles insufficient visual knowledge extraction in Multimodal Large Language Models by introducing LION, a model that injects dual-level visual knowledge: fine-grained spatial awareness through region-level VL tasks and a soft, high-level semantic cue via image tags. It addresses internal conflicts between image-level and region-level objectives with a stage-wise instruction-tuning strategy and a Mixture-of-Adapters router, coupled with a Vision Aggregator to fuse multi-level vision features. The approach yields state-of-the-art results across image-captioning, VQA, and visual grounding benchmarks, and demonstrates reduced object hallucination and stronger reasoning capabilities on robustness tests. The combination of progressive knowledge incorporation and soft semantic prompting offers a practical path to more reliable and capable MLLMs in real-world multimodal tasks.

Abstract

Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).
Paper Structure (25 sections, 5 equations, 7 figures, 9 tables)

This paper contains 25 sections, 5 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Comparison between existing MLLMs and LION . The existing MLLM generates a vague and inaccurate response, while LION provides a more precise and contextually accurate description by progressively incorporating spatial-aware knowledge and softly prompting semantic visual evidence.
  • Figure 2: Compared to recently proposed MLLMs, LION achieves state-of-the-art performances across a wide range of VL tasks.
  • Figure 3: Overview of the proposed LION . The model extracts holistic visual features from Q-Former, and combines them with fine-grained spatial-aware visual features from the vision aggregator. The Mixture-of-Adapters with a router in the frozen LLM dynamically fuses visual knowledge learned from different visual branches and LLM adapters based on the task types (image-level and region-level).
  • Figure 4: The stage-wise instruction-tuning strategy. Stage 1: We instruction-tune Q-Former and the image-level adapter on image-level VL tasks. Stage 2: We instruction-tune the vision aggregator (VA), MLP, and the region-level adapter on region-level VL tasks. Stage 3: The Mixture-of-Adapters is devised to form a unified model for instruction-tuning on both kinds of VL tasks.
  • Figure 5: Instruction template with soft prompt. We use a well-designed instruction template with trainable soft prompts to inject the image tags generated by the RAM model into LION.
  • ...and 2 more figures