Table of Contents
Fetching ...

Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation

Jingjing Xie, Yuxin Zhang, Mingbao Lin, Liujuan Cao, Rongrong Ji

TL;DR

The paper tackles the high resource cost of vision-language instruction tuning for multimodal LLMs by introducing QSLAW, a quantization-aware scale learning method. QSLAW learns group-wise weight scales to mitigate activation outliers and employs a modality-aware multimodal warmup to prevent overfitting while preserving linguistic capabilities. Experimental results on ScienceQA and a multimodal ChatBot show QSLAW achieves near-parity or superiority to full-precision tuning while significantly reducing training time and GPU usage, outperforming QLoRA in multimodal tasks. This work provides a practical path toward affordable, scalable VL instruction tuning for large multimodal models.

Abstract

This paper presents the first study to explore the potential of parameter quantization for multimodal large language models to alleviate the significant resource constraint encountered during vision-language instruction tuning. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. This method is grounded in two key innovations: (1) The learning of group-wise scale factors for quantized LLM weights to mitigate the quantization error arising from activation outliers and achieve more effective vision-language instruction tuning; (2) The implementation of a multimodal warmup that progressively integrates linguistic and multimodal training samples, thereby preventing overfitting of the quantized model to multimodal data while ensuring stable adaptation of multimodal large language models to downstream vision-language tasks. Extensive experiments demonstrate that models quantized by QSLAW perform on par with, or even surpass, their full-precision counterparts, while facilitating up to 1.4 times reduction in VL tuning time and GPU consumption. Our code is released at https://github.com/xjjxmu/QSLAW.

Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation

TL;DR

The paper tackles the high resource cost of vision-language instruction tuning for multimodal LLMs by introducing QSLAW, a quantization-aware scale learning method. QSLAW learns group-wise weight scales to mitigate activation outliers and employs a modality-aware multimodal warmup to prevent overfitting while preserving linguistic capabilities. Experimental results on ScienceQA and a multimodal ChatBot show QSLAW achieves near-parity or superiority to full-precision tuning while significantly reducing training time and GPU usage, outperforming QLoRA in multimodal tasks. This work provides a practical path toward affordable, scalable VL instruction tuning for large multimodal models.

Abstract

This paper presents the first study to explore the potential of parameter quantization for multimodal large language models to alleviate the significant resource constraint encountered during vision-language instruction tuning. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. This method is grounded in two key innovations: (1) The learning of group-wise scale factors for quantized LLM weights to mitigate the quantization error arising from activation outliers and achieve more effective vision-language instruction tuning; (2) The implementation of a multimodal warmup that progressively integrates linguistic and multimodal training samples, thereby preventing overfitting of the quantized model to multimodal data while ensuring stable adaptation of multimodal large language models to downstream vision-language tasks. Extensive experiments demonstrate that models quantized by QSLAW perform on par with, or even surpass, their full-precision counterparts, while facilitating up to 1.4 times reduction in VL tuning time and GPU consumption. Our code is released at https://github.com/xjjxmu/QSLAW.
Paper Structure (20 sections, 8 equations, 6 figures, 4 tables)

This paper contains 20 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Absolute magnitude of the input activation in one LLaVA-13B block. Left (image and text tokens) exhibits a larger scale in activation compared to the right (only text tokens).
  • Figure 2: Loss and accuracy curves of training scaling with different strategies on ScienceQA. Solely utilizing multimodal data for training scaling tends to lead the LLM overfitted to downstream tasks. This is evidenced by a rapid decrease in loss but the accuracy remain mediocre.
  • Figure 3: Comparation among various VL instruction tuning paradigms with the examples under different multimodal instruction-following tasks including visual comprehension, image caption and multimodal reasoning. More detailed parts of the response are marked in red and the misunderstandings in responses are marked in bluesky.
  • Figure 4: GPT-4 scores for QSLAW and QLoRA. Higher score represents high quality and the reasons why QSLAW obtains a higher score are highlighted in red.
  • Figure 5: The training process with different strategies. With our multimodal warmup strategy, the training process exhibits faster and more stable fitting.
  • ...and 1 more figures