Table of Contents
Fetching ...

Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Dongwon Jo, Taesu Kim, Yulhwa Kim, Jae-Joon Kim

TL;DR

A novel binarization technique called Mixture of Scales (BinaryMoS), which surpasses conventional binarization techniques in various natural language processing tasks and even outperforms 2-bit quantization methods, all while maintaining similar model size to static binarization techniques.

Abstract

Binarization, which converts weight parameters to binary values, has emerged as an effective strategy to reduce the size of large language models (LLMs). However, typical binarization techniques significantly diminish linguistic effectiveness of LLMs. To address this issue, we introduce a novel binarization technique called Mixture of Scales (BinaryMoS). Unlike conventional methods, BinaryMoS employs multiple scaling experts for binary weights, dynamically merging these experts for each token to adaptively generate scaling factors. This token-adaptive approach boosts the representational power of binarized LLMs by enabling contextual adjustments to the values of binary weights. Moreover, because this adaptive process only involves the scaling factors rather than the entire weight matrix, BinaryMoS maintains compression efficiency similar to traditional static binarization methods. Our experimental results reveal that BinaryMoS surpasses conventional binarization techniques in various natural language processing tasks and even outperforms 2-bit quantization methods, all while maintaining similar model size to static binarization techniques.

Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

TL;DR

A novel binarization technique called Mixture of Scales (BinaryMoS), which surpasses conventional binarization techniques in various natural language processing tasks and even outperforms 2-bit quantization methods, all while maintaining similar model size to static binarization techniques.

Abstract

Binarization, which converts weight parameters to binary values, has emerged as an effective strategy to reduce the size of large language models (LLMs). However, typical binarization techniques significantly diminish linguistic effectiveness of LLMs. To address this issue, we introduce a novel binarization technique called Mixture of Scales (BinaryMoS). Unlike conventional methods, BinaryMoS employs multiple scaling experts for binary weights, dynamically merging these experts for each token to adaptively generate scaling factors. This token-adaptive approach boosts the representational power of binarized LLMs by enabling contextual adjustments to the values of binary weights. Moreover, because this adaptive process only involves the scaling factors rather than the entire weight matrix, BinaryMoS maintains compression efficiency similar to traditional static binarization methods. Our experimental results reveal that BinaryMoS surpasses conventional binarization techniques in various natural language processing tasks and even outperforms 2-bit quantization methods, all while maintaining similar model size to static binarization techniques.
Paper Structure (23 sections, 8 equations, 4 figures, 7 tables)

This paper contains 23 sections, 8 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: A brief overview of various LLM binarization methods. PB-LLM involves both a binary weight matrix and a high-precision, sparse weight matrix, and BiLLM stores four types of binary weight matrices. OneBit simplifies the layer structure by introducing scaling factors for input and output dimensions respectively. BinaryMoS introduces multiple scaling experts to enhance the capacity of binarized models.
  • Figure 2: Illustration of the proposed BinaryMoS scheme. The proposed BinaryMoS introduce mixture of scale approach to generate token-adaptive scaling factors.
  • Figure 3: ($\textbf{a}$) Gating scores of 4 scaling experts in 18th layer of LLaMA-1-7B model for each token in the input sequence. ($\textbf{b}$) Distribution of values of token-adaptive scaling factors. The boxplot visually presents the distribution of token-adaptive scaling factors among processed tokens. The box spans the interquartile range, indicating the middle 50% of the scaling factors. Extending from the box are whiskers that reach the furthest data points within 1.5 times the interquartile range, providing insight into the overall range of the data.
  • Figure 4: Comparison of generation quality on the LLaMA-1-13B models with BinaryMoS and OneBit.