Table of Contents
Fetching ...

Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Ba-Hien Tran, Van Minh Nguyen

TL;DR

This paper tackles the high cost of quantizing and binarizing LLMs by introducing mbok, a native Boolean framework that trains weights directly in the Boolean domain using multiple Boolean kernels. It combines a rank-1 Boolean reformulation (SVID) with successive kernel extraction and knowledge distillation to transfer and refine information from a full-precision teacher, while automatically allocating kernels under a fixed budget. Empirically, mbok achieves near fp16 performance with extremely low bitrates (e.g., 2–3 kernels) and outperforms state-of-the-art ultra low-bit quantization and binarization baselines across OPT and Llama family models, with substantial memory and latency benefits. The approach is poised to enable efficient deployment on conventional hardware and motivates the development of dedicated Boolean accelerators to maximize gains in training and inference efficiency.

Abstract

Weight binarization has emerged as a promising strategy to reduce the complexity of large language models (LLMs). Existing approaches fall into post-training binarization, which is simple but causes severe performance loss, and training-aware methods, which depend on full-precision latent weights, adding complexity and limiting efficiency. We propose a novel framework that represents LLMs with multi-kernel Boolean parameters and, for the first time, enables direct finetuning LMMs in the Boolean domain, eliminating the need for latent weights. This enhances representational capacity and dramatically reduces complexity during both finetuning and inference. Extensive experiments across diverse LLMs show our method outperforms recent ultra low-bit quantization and binarization techniques.

Highly Efficient and Effective LLMs with Multi-Boolean Architectures

TL;DR

This paper tackles the high cost of quantizing and binarizing LLMs by introducing mbok, a native Boolean framework that trains weights directly in the Boolean domain using multiple Boolean kernels. It combines a rank-1 Boolean reformulation (SVID) with successive kernel extraction and knowledge distillation to transfer and refine information from a full-precision teacher, while automatically allocating kernels under a fixed budget. Empirically, mbok achieves near fp16 performance with extremely low bitrates (e.g., 2–3 kernels) and outperforms state-of-the-art ultra low-bit quantization and binarization baselines across OPT and Llama family models, with substantial memory and latency benefits. The approach is poised to enable efficient deployment on conventional hardware and motivates the development of dedicated Boolean accelerators to maximize gains in training and inference efficiency.

Abstract

Weight binarization has emerged as a promising strategy to reduce the complexity of large language models (LLMs). Existing approaches fall into post-training binarization, which is simple but causes severe performance loss, and training-aware methods, which depend on full-precision latent weights, adding complexity and limiting efficiency. We propose a novel framework that represents LLMs with multi-kernel Boolean parameters and, for the first time, enables direct finetuning LMMs in the Boolean domain, eliminating the need for latent weights. This enhances representational capacity and dramatically reduces complexity during both finetuning and inference. Extensive experiments across diverse LLMs show our method outperforms recent ultra low-bit quantization and binarization techniques.

Paper Structure

This paper contains 74 sections, 12 theorems, 56 equations, 11 figures, 12 tables, 9 algorithms.

Key Result

Proposition 4.1

xu2024onebit For ${{\boldsymbol{\mathbf{W}}}} \in \mathbb{R}^{m\times n}$, write ${{\boldsymbol{\mathbf{W}}}} = \widetilde{{{\boldsymbol{\mathbf{U}}}}} \widetilde{{{\boldsymbol{\mathbf{\Sigma}}}}} \widetilde{{{\boldsymbol{\mathbf{V}}}}}^{\top}$ its . Let ${{\boldsymbol{\mathbf{a}}}} = \sqrt{\tilde{\

Figures (11)

  • Figure 1: Finetuning opt models zhang2022opt using our 3 Boolean kernels (), compared to gptqfrantar2023optq (), which quantizes the models to 3 bits, and the fp16 baseline () on the C4 dataset.
  • Figure 2: Illustration of svid.
  • Figure 3: The computation of a linear layer approximated using multi kernels of Boolean.
  • Figure 4: Illustration of successive extractions of Boolean kernels from a given weight matrix.
  • Figure 5: Normalized L1 norm difference between the approximated weights at initialization and after finetuning against the weights ($\|{{\boldsymbol{\mathbf{W}}}}_{\text{approx}} - {{\boldsymbol{\mathbf{W}}}}_{\text{FP}}\|_{1} / \|{{\boldsymbol{\mathbf{W}}}}_{\text{FP}}\|_{1}$) , and the final results.
  • ...and 6 more figures

Theorems & Definitions (35)

  • Proposition 4.1
  • Remark 4.2
  • Proposition 4.3
  • Definition A.1: Three-valued logic
  • Definition A.3
  • Definition A.4
  • Definition A.5: Mixed-type logic
  • Definition A.6
  • Definition A.7
  • Definition A.8: Type conversion
  • ...and 25 more