Table of Contents
Fetching ...

Wings: Learning Multimodal LLMs without Text-only Forgetting

Yi-Kai Zhang, Shiyin Lu, Yang Li, Yanqing Ma, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

TL;DR

Wings addresses text-only forgetting in multimodal LLMs by introducing parallel visual and textual learners built with Low-Rank Residual Attention (LoRRA) and an attention-based router that compensates for attention shifts around inserted visual tokens. The authors diagnose forgetting via the MLLM-Laws metric, showing cross-layer shifts between text segments before and after images correlate with performance drops, and use this insight to design Wings. Empirically, Wings improves text-only and multimodal QA, achieving state-of-the-art results on multiple benchmarks and excelling on the newly proposed Interleaved Image-Text (IIT) benchmark, while remaining efficient through low-rank adapters. The work demonstrates a general, resource-efficient strategy to retain language capabilities while enabling robust multimodal reasoning in mixed-input settings, offering practical impact for real-world vision-language systems.

Abstract

Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like "wings" on either side, are connected in parallel within each layer's attention block. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.

Wings: Learning Multimodal LLMs without Text-only Forgetting

TL;DR

Wings addresses text-only forgetting in multimodal LLMs by introducing parallel visual and textual learners built with Low-Rank Residual Attention (LoRRA) and an attention-based router that compensates for attention shifts around inserted visual tokens. The authors diagnose forgetting via the MLLM-Laws metric, showing cross-layer shifts between text segments before and after images correlate with performance drops, and use this insight to design Wings. Empirically, Wings improves text-only and multimodal QA, achieving state-of-the-art results on multiple benchmarks and excelling on the newly proposed Interleaved Image-Text (IIT) benchmark, while remaining efficient through low-rank adapters. The work demonstrates a general, resource-efficient strategy to retain language capabilities while enabling robust multimodal reasoning in mixed-input settings, offering practical impact for real-world vision-language systems.

Abstract

Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like "wings" on either side, are connected in parallel within each layer's attention block. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.
Paper Structure (17 sections, 6 equations, 8 figures, 2 tables)

This paper contains 17 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Examples of text-only and multimodal conversations. From left to right: Interacting with MLLM through text-only and interleaved instructions; Performance radar charts for Wings, LLaVA-Next liu2024llavanext, and DeepSeek-VL lu2024deepseekvl in text-only and multimodal QA tasks, with dark green indicating Wings with the comprehensive performance; Interacting with multimodal instructions.
  • Figure 2: Illustration of mixed visual-and-textual inputs and the Layer-level Attention Weights ( Laws) with its properties. (a) The visual feature tokens from the visual encoder and projector are inserted into the textual feature sequence. (b) The attention weight proportion on textual tokens before-image, image-itself, and after-image across layers. The red curve is from the superior text-only MLLM, while the blue curve is from the inferior one. (c) Experiments on over $100$ MLLMs show a positive correlation from the $\boldsymbol{\rho}$ for MLLM- Laws before and after the visual tokens ($x$-axis) to the text-only performance of the MLLM ($y$-axis).
  • Figure 3: The Wings - model architecture. We introduce extra modules parallel to the main attention, serving as boosted learners to compensate for the attention shift. We train the visual learners on one side, alleviating some shifted attention. Then, we collaboratively learn visual and textual learners based on routing shifted attention weights. They are like light feathers woven "wings".
  • Figure 4: Illustrations of the detailed Wings structure, and training strategies.Wings is constructed by the Low-Rank Residual Attention (LoRRA) module where the previous hidden state acts as the query and the visual/textual features serve as the key and value. Training starts with visual learners and projectors, followed by the dynamic attention-based routing.
  • Figure 5: Performance comparison on the newly constructed Interleaved Image and Text (IIT) Benchmark of the LLaVA series, different learning rate and fine-tuning parts. The horizontal axis represents different multimodal question settings. The horizontal axis shows different multimodal setups, e.g., (T, T, I) represents a visual question after two text-only QAs. The three subfigures represent different ablation settings, with the violet color representing our Wings.
  • ...and 3 more figures