Table of Contents
Fetching ...

Towards Versatile and Efficient Visual Knowledge Integration into Pre-trained Language Models with Cross-Modal Adapters

Xinyun Zhang, Haochen Tan, Han Wu, Bei Yu

TL;DR

This work tackles the lack of visual knowledge in text-only pre-trained language models by introducing X-adapter, a plug-and-play module that fuses CLIP-derived image and text representations into PLMs through two sub-modules: V-expert for images and T-expert for text. Only the added adapter parameters are trained, enabling memory-efficient adaptation while leveraging pre-trained vision-language capabilities. Empirical results show large gains on zero-shot object color reasoning (approximately +32% for BERT-base and +24% for RoBERTa-base) and consistent improvements on GLUE-based NLU tasks, with careful ablations highlighting the importance of CLIP features, retrieval strategies, and insertion positions. The approach offers a versatile, scalable path to integrate multimodal knowledge into PLMs, improving visual commonsense reasoning and language understanding without full fine-tuning of large models.

Abstract

Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained language models (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-language models (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs' image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.

Towards Versatile and Efficient Visual Knowledge Integration into Pre-trained Language Models with Cross-Modal Adapters

TL;DR

This work tackles the lack of visual knowledge in text-only pre-trained language models by introducing X-adapter, a plug-and-play module that fuses CLIP-derived image and text representations into PLMs through two sub-modules: V-expert for images and T-expert for text. Only the added adapter parameters are trained, enabling memory-efficient adaptation while leveraging pre-trained vision-language capabilities. Empirical results show large gains on zero-shot object color reasoning (approximately +32% for BERT-base and +24% for RoBERTa-base) and consistent improvements on GLUE-based NLU tasks, with careful ablations highlighting the importance of CLIP features, retrieval strategies, and insertion positions. The approach offers a versatile, scalable path to integrate multimodal knowledge into PLMs, improving visual commonsense reasoning and language understanding without full fine-tuning of large models.

Abstract

Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained language models (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-language models (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs' image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.
Paper Structure (32 sections, 6 equations, 6 figures, 13 tables, 1 algorithm)

This paper contains 32 sections, 6 equations, 6 figures, 13 tables, 1 algorithm.

Figures (6)

  • Figure 1: The main idea of X-adapters. For different downstream tasks we activate different sub-modules in X-adapters to fully exploit the VLMs. During adaptation, only X-adapters' parameters are updated.
  • Figure 2: (a): The main architecture of our proposed method; (b): The detailed architecture of V-expert; (c): The detailed architecture of T-expert.
  • Figure 3: Abalation study on the mask ratio. (a) Performance for T-expert; (b) Performance for V-expert.
  • Figure 4: Ablation study on the number of images retrieved for the input text.
  • Figure 5: Top-5 relevant images retrieved for the prompt "What is the color of the banana? It is [MASK]."
  • ...and 1 more figures