Table of Contents
Fetching ...

IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities

Bin Wang, Chunyu Xie, Dawei Leng, Yuhui Yin

TL;DR

The paper tackles the challenge of preserving NLP capabilities while equipping frozen large language models with robust multimodal understanding. It introduces the Inner-Adaptor Architecture (IAA), which inserts adaptable multimodal layers inside a frozen LLM and adds a dedicated embedding layer and LM head for multimodal inputs, enabling effective image-text interaction without fine-tuning the LLM. Through a carefully designed two-stage pre-training, instruction fine-tuning, and grounding fine-tuning regimen, IAA achieves state-of-the-art or competitive results on general multimodal benchmarks and visual grounding with substantially smaller data requirements, while maintaining strong text-only NLP performance. The deployment-friendly design supports dual workflows (multimodal and text-only) and shows improved memory efficiency, suggesting practical applicability and potential extension to additional modalities in future work.

Abstract

In the field of multimodal large language models (MLLMs), common methods typically involve unfreezing the language model during training to foster profound visual understanding. However, the fine-tuning of such models with vision-language data often leads to a diminution of their natural language processing (NLP) capabilities. To avoid this performance degradation, a straightforward solution is to freeze the language model while developing multimodal competencies. Unfortunately, previous works have not attained satisfactory outcomes. Building on the strategy of freezing the language model, we conduct thorough structural exploration and introduce the Inner-Adaptor Architecture (IAA). Specifically, the architecture incorporates multiple multimodal adaptors at varying depths within the large language model to facilitate direct interaction with the inherently text-oriented transformer layers, thereby enabling the frozen language model to acquire multimodal capabilities. Unlike previous approaches of freezing language models that require large-scale aligned data, our proposed architecture is able to achieve superior performance on small-scale datasets. We conduct extensive experiments to improve the general multimodal capabilities and visual grounding abilities of the MLLM. Our approach remarkably outperforms previous state-of-the-art methods across various vision-language benchmarks without sacrificing performance on NLP tasks. Code and models are available at https://github.com/360CVGroup/Inner-Adaptor-Architecture.

IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities

TL;DR

The paper tackles the challenge of preserving NLP capabilities while equipping frozen large language models with robust multimodal understanding. It introduces the Inner-Adaptor Architecture (IAA), which inserts adaptable multimodal layers inside a frozen LLM and adds a dedicated embedding layer and LM head for multimodal inputs, enabling effective image-text interaction without fine-tuning the LLM. Through a carefully designed two-stage pre-training, instruction fine-tuning, and grounding fine-tuning regimen, IAA achieves state-of-the-art or competitive results on general multimodal benchmarks and visual grounding with substantially smaller data requirements, while maintaining strong text-only NLP performance. The deployment-friendly design supports dual workflows (multimodal and text-only) and shows improved memory efficiency, suggesting practical applicability and potential extension to additional modalities in future work.

Abstract

In the field of multimodal large language models (MLLMs), common methods typically involve unfreezing the language model during training to foster profound visual understanding. However, the fine-tuning of such models with vision-language data often leads to a diminution of their natural language processing (NLP) capabilities. To avoid this performance degradation, a straightforward solution is to freeze the language model while developing multimodal competencies. Unfortunately, previous works have not attained satisfactory outcomes. Building on the strategy of freezing the language model, we conduct thorough structural exploration and introduce the Inner-Adaptor Architecture (IAA). Specifically, the architecture incorporates multiple multimodal adaptors at varying depths within the large language model to facilitate direct interaction with the inherently text-oriented transformer layers, thereby enabling the frozen language model to acquire multimodal capabilities. Unlike previous approaches of freezing language models that require large-scale aligned data, our proposed architecture is able to achieve superior performance on small-scale datasets. We conduct extensive experiments to improve the general multimodal capabilities and visual grounding abilities of the MLLM. Our approach remarkably outperforms previous state-of-the-art methods across various vision-language benchmarks without sacrificing performance on NLP tasks. Code and models are available at https://github.com/360CVGroup/Inner-Adaptor-Architecture.
Paper Structure (33 sections, 2 equations, 8 figures, 8 tables)

This paper contains 33 sections, 2 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Results before and after training LLaVA-1.5 architecture based on Qwen2 and Llama3 language models on text-only evaluation set MMLU and C-eval.
  • Figure 2: Overview of the proposed architecture, which mainly consists of two workflows: the Multimodal Workflow and the Text-only Workflow. The multimodal workflow, beyond the necessary image encoder and projector, integrates the Inner-Adaptor Architecture, including insertion layers, an embedding layer, and a language model head. Both workflows share the same large language model. The number of insertion layers is variable, where $N \leq M$. In this context, MM denotes MultiModal, EL stands for Embedding Layer, and LH represents the Language model Head.
  • Figure 3: Structural exploration of the Inner-Adaptor Architecture. Figure (a) is a architecture inspired by the ControlNet design; Figure (b) is an improvement on Figure (a), mainly canceling the feature propagation between adaptors; Figure (c) is the final scheme.
  • Figure 4: Comparison on text-only question answering.
  • Figure 5: Samples of image comprehension and general knowledge question answering.
  • ...and 3 more figures