Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

Yunxin Li; Zhenyu Liu; Baotian Hu; Wei Wang; Yuxin Ding; Xiaochun Cao; Min Zhang

Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

Yunxin Li, Zhenyu Liu, Baotian Hu, Wei Wang, Yuxin Ding, Xiaochun Cao, Min Zhang

TL;DR

This work introduces MKS2, a vision-enhanced framework that augments LLMs with open-world visual knowledge by integrating Modular Visual Memory (MVM) into each transformer block and a soft Mixture of Multimodal Experts (MoMEs) to fuse visual and textual knowledge during generation. The approach comprises two stages: Visual Information Storage, where MVM is trained on image-caption and text-to-image retrieval objectives while the LLM remains frozen, and Multimodal Knowledge Collaboration, where MoMEs combine a Visual Expert (MVM) and a Textual Expert (MLP) via LoRA-based adapters at the token level. Empirically, MKS2-Llama-2-13b achieves state-of-the-art zero-shot performance on several NLP benchmarks and competitive results on image-text understanding, with ablations showing that the visual memory and mixture-of-experts contribute meaningfully to both language and multimodal tasks. The results suggest that storing and leveraging visual knowledge inside LLMs can enhance reasoning and knowledge-based questions without sacrificing textual capabilities, highlighting the practical potential of Vision-Enhanced LLMs for open-world reasoning and multimodal QA.

Abstract

Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing

Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

TL;DR

Abstract

Paper Structure (25 sections, 6 equations, 7 figures, 6 tables)

This paper contains 25 sections, 6 equations, 7 figures, 6 tables.

Introduction
Related Work
Large Language Model
Visual knowledge enhanced methods
LLMs for vision
Preliminaries
Supervised Fine-tuning
Multimodal Instruction-Following Tuning
Methodology
Visual Information Storage
Multimodal Knowledge Collaboration
Training and Data Recipes
Experiments
Datasets
Natural Language Processing Benchmarks
...and 10 more sections

Figures (7)

Figure 1: Comparisons between the proposed MKS2 and previous supervised fine-tuned (SFT) and multimodal LLMs. MKS2 focuses on improving LLMs with visual knowledge. VMN refers to the visual mapping network, transferring image encoding to the language space. MVM and MoMEs represent the proposed modular visual memory and the architecture of a soft mixture of multimodal experts in LLMs, respectively.
Figure 2: MKS2-Llama-2-13b achieves SOTA zero-shot performance on seven natural language reasoning tasks. It indicates that achieving multimodal knowledge storage and sharing is effective for improving LLMs.
Figure 3: The overall workflow of MKS2. It realizes visual information storage and multimodal knowledge collaboration in LLMs. In the first stage, we introduce the modular visual memory (MVM) and train it through language-centric learning strategies on large-scale image-text pairs. We also present a soft mixture-of-multimodal experts (MoMEs) architecture to accomplish multimodal knowledge collaboration during text generation.
Figure 4: The detailed calculation process of the proposed soft mixture-of-multimodal experts (MoMEs) architecture. It aims to realize multimodal knowledge collaboration during text generation.
Figure 5: An illustration of cases generated by the pretrained MVM, where we add the generation loss as the supervision object. It's important to highlight that we assess image generation quality primarily to verify whether the added MVM effectively stores text-visual knowledge and establishes the connection between language and vision. Fortunately, the MVM can connect the language to their corresponding imagination.
...and 2 more figures

Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

TL;DR

Abstract

Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)