Table of Contents
Fetching ...

Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors

Matt Gorbett, Hossein Shirazi, Indrakshi Ray

TL;DR

Tiled Bit Networks (TBNs) introduce learnable binary tiles to fill layer weights, enabling sub-bit (often 4x–8x) compression while preserving near full-precision performance across CNNs, Transformers, and MLP-based models. The method reshapes and aggregates weights during training to produce per-layer tiles and reuses a single tile per layer during inference, with optional per-tile scaling factors. Empirically, TBNs achieve substantial memory reductions and demonstrate viability on microcontrollers and GPUs, including strong results on CIFAR/ImageNet, PointNet, ViT, and time-series tasks. The work promises memory-efficient, on-device deployment of large models and outlines practical hyperparameter guidelines and two practical implementations for real-world use.

Abstract

Binary Neural Networks (BNNs) enable efficient deep learning by saving on storage and computational costs. However, as the size of neural networks continues to grow, meeting computational requirements remains a challenge. In this work, we propose a new form of quantization to tile neural network layers with sequences of bits to achieve sub-bit compression of binary-weighted neural networks. The method learns binary vectors (i.e. tiles) to populate each layer of a model via aggregation and reshaping operations. During inference, the method reuses a single tile per layer to represent the full tensor. We employ the approach to both fully-connected and convolutional layers, which make up the breadth of space in most neural architectures. Empirically, the approach achieves near fullprecision performance on a diverse range of architectures (CNNs, Transformers, MLPs) and tasks (classification, segmentation, and time series forecasting) with up to an 8x reduction in size compared to binary-weighted models. We provide two implementations for Tiled Bit Networks: 1) we deploy the model to a microcontroller to assess its feasibility in resource-constrained environments, and 2) a GPU-compatible inference kernel to facilitate the reuse of a single tile per layer in memory.

Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors

TL;DR

Tiled Bit Networks (TBNs) introduce learnable binary tiles to fill layer weights, enabling sub-bit (often 4x–8x) compression while preserving near full-precision performance across CNNs, Transformers, and MLP-based models. The method reshapes and aggregates weights during training to produce per-layer tiles and reuses a single tile per layer during inference, with optional per-tile scaling factors. Empirically, TBNs achieve substantial memory reductions and demonstrate viability on microcontrollers and GPUs, including strong results on CIFAR/ImageNet, PointNet, ViT, and time-series tasks. The work promises memory-efficient, on-device deployment of large models and outlines practical hyperparameter guidelines and two practical implementations for real-world use.

Abstract

Binary Neural Networks (BNNs) enable efficient deep learning by saving on storage and computational costs. However, as the size of neural networks continues to grow, meeting computational requirements remains a challenge. In this work, we propose a new form of quantization to tile neural network layers with sequences of bits to achieve sub-bit compression of binary-weighted neural networks. The method learns binary vectors (i.e. tiles) to populate each layer of a model via aggregation and reshaping operations. During inference, the method reuses a single tile per layer to represent the full tensor. We employ the approach to both fully-connected and convolutional layers, which make up the breadth of space in most neural architectures. Empirically, the approach achieves near fullprecision performance on a diverse range of architectures (CNNs, Transformers, MLPs) and tasks (classification, segmentation, and time series forecasting) with up to an 8x reduction in size compared to binary-weighted models. We provide two implementations for Tiled Bit Networks: 1) we deploy the model to a microcontroller to assess its feasibility in resource-constrained environments, and 2) a GPU-compatible inference kernel to facilitate the reuse of a single tile per layer in memory.
Paper Structure (18 sections, 9 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 18 sections, 9 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Tiling Illustration: A binary tile (left) of size $k=4$ is replicated four time to create a weight matrix of size 16 (right). Tiling is used during the training process of to learn vectors for populating the parameters of a model (as illustrated above). During inference, only a single tile needs to be referenced per layer (left) -- a specialized kernel can reuse the tile throughout layer computation for memory savings.
  • Figure 2: Composition of popular : The ResNet series is made up primarily of convolutional layers; MLPs (PointNet, MLPMixer) and Transformers (Swin-t, ViT, Mobile ViT) consist mostly of fully-connected parameters.
  • Figure 3: Tile Construction During Training: For each layer of a neural network, we train a standard weight tensor ($\mathbf{W}$) (left). During the training, we compress the parameter by a factor of $p$ by performing a reshaping (second column top) and then sum operation (second column middle). We use the straight-through estimator to binarize the vector $\mathbf{s}$, creating the tile $\mathbf{t}$ (bottom of the second column). We next create binary weights $\mathbf{B}$ from the resulting binary vector by tiling vector $\mathbf{t}$ two times and reshaping it to an $n \times m$ tensor (third column). Finally, we apply a scalar $\alpha$ over each of the two tiles, resulting in the final weight tensor $\mathbf{\hat{B}}$. During the inference, only a single tile is needed, along with a small number of $\alpha$ scalars.
  • Figure 4: We learn scalar $\alpha$ from tensor $\mathbf{A}$ by computing Equation \ref{['eq6']} or \ref{['eq8']} over its values ($\mathbf{W}$ can also be used in place of $\mathbf{A}$). The figure visualizes Equation \ref{['eq8']} which calculates $\alpha$ over each tile.
  • Figure 5: GPU memory allocated during model inference: We profile the memory of the ImageNet (left) and PointNet (right) during inference using a customized GPU kernel with full-precision weights. The x-axis represents memory recorded within intermediate model layers during execution of a PyTorch model. The tiled kernel achieves 2.8x memory reduction on the and 1.2x reduction on PointNet.
  • ...and 3 more figures