Table of Contents
Fetching ...

MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

Lulu Hu, Wenhu Xiao, Xin Chen, Xinhua Xu, Bowen Xu, Kun Li, Yongliang Tao

TL;DR

Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment and Cross-Modal Computational Invariance, and Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance.

Abstract

Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models~(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms. Source code: https://github.com/alibaba/EfficientAI.

MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

TL;DR

Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment and Cross-Modal Computational Invariance, and Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance.

Abstract

Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models~(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms. Source code: https://github.com/alibaba/EfficientAI.
Paper Structure (27 sections, 2 theorems, 27 equations, 6 figures, 7 tables)

This paper contains 27 sections, 2 theorems, 27 equations, 6 figures, 7 tables.

Key Result

Theorem 1

[SQNR Degradation under Smoothing Misalignment] Consider a layer processing multimodal inputs with dominant modality $m$ and non-dominant modality $m'$. Let $\boldsymbol{\alpha}^{m,m'}_i = R^m_i / R^{m'}_i$ denote the range ratio at channel $i$, where $R^m_i$ and $R^{m'}_i$ are the activation ranges

Figures (6)

  • Figure 1: (a). Activation distributions during multimodal reasoning in MLLMs. Different dominant modalities emerge across MLLM components, leading to failure of general PTQ methods that diminish vision importance. (b). Impact of SmoothQuant's uniform smoothing factors $S$ computation on MLLM quantization performance (low SQNR, high PPL). (c). MASQuant addresses smoothing misalignment through the combination of MAS and CMC, thereby significantly enhancing PTQ performance in MLLMs. MBR Loss indicates Modality Balanced Reconstruction Loss.
  • Figure 2: Comparative analysis of SQNR degradation of Qwen2.5-Omni-3B under multimodal input condition. We selected 32 samples from OmniBench and computed the average SQNR for each layer.
  • Figure 3: The illustrated case demonstrates a text-vision dual-modal setting. (a) Schematic workflow of MAS and CMC with calibration data, (b) Illustration of how low-rank matrices L1 and L2 in CMC are utilized in MASQuant, exemplified with an MLP block.
  • Figure 4: Percentage of unified smoothing factors from different modalities using SmoothQuant on Omni and VL MLLMs.
  • Figure 5: Effective ranks of $\Delta\mathbf{W}$ is reduced across layers after SVD-based Whitening b SQNR improves as the rank ratio increases on both Qwen2.5-VL-3B and Qwen2.5-Omni-3B.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2: Optimal Low-rank Compensation
  • proof