Table of Contents
Fetching ...

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Yunxin Li, Xinyu Chen, Baotian Hu, Haoyuan Shi, Min Zhang

TL;DR

A Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage, which significantly improves the performance of LMMs on knowledge-based VQA.

Abstract

Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small language model and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into Large Language Models (LLMs). We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

TL;DR

A Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage, which significantly improves the performance of LMMs on knowledge-based VQA.

Abstract

Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small language model and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into Large Language Models (LLMs). We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper
Paper Structure (17 sections, 4 equations, 4 figures, 4 tables)

This paper contains 17 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: It illustrates the performance of LMMs on visual information-seeking questions. The bottom part shows the widely-used architecture of open-source LMMs, where the visual mapping network is usually pretrained on massive image-text captioning data. All LMMs including GPT-4V (Date: 2023.11.17) and Gemini-Pro make incorrect decisions.
  • Figure 2: An overview of Cognitive Visual Knowledge Mapper. From top to bottom, it shows 1) Pretraining visual knowledge aligner, where we use a pretrained small language model to interact with image features via the cross attention module; 2) Training visual knowledge aligner with LLM, in which we realize visual knowledge alignment between vision encoder and LLM via the learnable query tokens and linear layer; 3) Overall architecture of CVLM, where we present the fine-grained visual knowledge adapter beyond common visual projection (MLP) and VKA.
  • Figure 3: The detailed calculation process of fine-grained visual knowledge adapter, i.e., VKA shown in Figure \ref{['fig:model']}. "Visual + Knowledge" indicates the representation concatenation of an image $\mathbf{h}_{\text{IO}}$ and its relevant knowledge projection $\mathbf{h}_{\text{KO}}$.
  • Figure 4: Three cases illustrate the comparative performances of CVLM and other models. Red words represent the correct answer and the purple words show the inaccurate response.