Table of Contents
Fetching ...

Large Multimodal Model Compression via Efficient Pruning and Distillation at AntGroup

Maolin Wang, Yao Zhao, Jiajia Liu, Jingdong Chen, Chenyi Zhuang, Jinjie Gu, Ruocheng Guo, Xiangyu Zhao

TL;DR

A novel multi-stage compression strategy for AntGMM that achieved a substantial reduction in latency, decreasing it from 700ms to 90ms, while maintaining online performance with only a slight performance decrease, and is estimated to reduce electricity consumption by approximately 75 million kWh annually.

Abstract

The deployment of Large Multimodal Models (LMMs) within AntGroup has significantly advanced multimodal tasks in payment, security, and advertising, notably enhancing advertisement audition tasks in Alipay. However, the deployment of such sizable models introduces challenges, particularly in increased latency and carbon emissions, which are antithetical to the ideals of Green AI. This paper introduces a novel multi-stage compression strategy for our proprietary LLM, AntGMM. Our methodology pivots on three main aspects: employing small training sample sizes, addressing multi-level redundancy through multi-stage pruning, and introducing an advanced distillation loss design. In our research, we constructed a dataset, the Multimodal Advertisement Audition Dataset (MAAD), from real-world scenarios within Alipay, and conducted experiments to validate the reliability of our proposed strategy. Furthermore, the effectiveness of our strategy is evident in its operational success in Alipay's real-world multimodal advertisement audition for three months from September 2023. Notably, our approach achieved a substantial reduction in latency, decreasing it from 700ms to 90ms, while maintaining online performance with only a slight performance decrease. Moreover, our compressed model is estimated to reduce electricity consumption by approximately 75 million kWh annually compared to the direct deployment of AntGMM, demonstrating our commitment to green AI initiatives. We will publicly release our code and the MAAD dataset after some reviews\footnote{https://github.com/MorinW/AntGMM$\_$Pruning}.

Large Multimodal Model Compression via Efficient Pruning and Distillation at AntGroup

TL;DR

A novel multi-stage compression strategy for AntGMM that achieved a substantial reduction in latency, decreasing it from 700ms to 90ms, while maintaining online performance with only a slight performance decrease, and is estimated to reduce electricity consumption by approximately 75 million kWh annually.

Abstract

The deployment of Large Multimodal Models (LMMs) within AntGroup has significantly advanced multimodal tasks in payment, security, and advertising, notably enhancing advertisement audition tasks in Alipay. However, the deployment of such sizable models introduces challenges, particularly in increased latency and carbon emissions, which are antithetical to the ideals of Green AI. This paper introduces a novel multi-stage compression strategy for our proprietary LLM, AntGMM. Our methodology pivots on three main aspects: employing small training sample sizes, addressing multi-level redundancy through multi-stage pruning, and introducing an advanced distillation loss design. In our research, we constructed a dataset, the Multimodal Advertisement Audition Dataset (MAAD), from real-world scenarios within Alipay, and conducted experiments to validate the reliability of our proposed strategy. Furthermore, the effectiveness of our strategy is evident in its operational success in Alipay's real-world multimodal advertisement audition for three months from September 2023. Notably, our approach achieved a substantial reduction in latency, decreasing it from 700ms to 90ms, while maintaining online performance with only a slight performance decrease. Moreover, our compressed model is estimated to reduce electricity consumption by approximately 75 million kWh annually compared to the direct deployment of AntGMM, demonstrating our commitment to green AI initiatives. We will publicly release our code and the MAAD dataset after some reviews\footnote{https://github.com/MorinW/AntGMMPruning}.
Paper Structure (28 sections, 5 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 5 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Model structure of concurrent version of AntGMM. We adopted a structure akin to BLIP-2 as the cornerstone within the Ant Group.
  • Figure 2: Block pruning. Block pruning involves removing the last layer (or a few last layers) at each iteration, followed by distillation to ensure minimal loss in model performance.
  • Figure 3: Intermediate-module dimention pruning. The second stage focuses on dimensionality pruning of the hidden layers of FFNs and attention mechanisms within blocks.
  • Figure 4: Input dimention pruning. The final stage reduces the number of parameters associated with both the input and output dimensions. This requires synchronized reduction across all AntGMM blocks.
  • Figure 5: A data sample from the dataset. It involves an advertisement image accompanied by a Chinese text segment. The text includes a carefully constructed prompt, possible assistant details, and a classification description of the advertisement. Some sensitive information has been processed.
  • ...and 7 more figures