Table of Contents
Fetching ...

Trainable Bitwise Soft Quantization for Input Feature Compression

Karsten Schrödter, Jan Stenkamp, Nina Herrmann, Fabian Gieseke

TL;DR

This work proposes a task-specific, trainable feature quantization layer that compresses the input features of a neural network, and outperforms standard quantization methods, while maintaining accuracy levels close to those of full-precision models.

Abstract

The growing demand for machine learning applications in the context of the Internet of Things calls for new approaches to optimize the use of limited compute and memory resources. Despite significant progress that has been made w.r.t. reducing model sizes and improving efficiency, many applications still require remote servers to provide the required resources. However, such approaches rely on transmitting data from edge devices to remote servers, which may not always be feasible due to bandwidth, latency, or energy constraints. We propose a task-specific, trainable feature quantization layer that compresses the input features of a neural network. This can significantly reduce the amount of data that needs to be transferred from the device to a remote server. In particular, the layer allows each input feature to be quantized to a user-defined number of bits, enabling a simple on-device compression at the time of data collection. The layer is designed to approximate step functions with sigmoids, enabling trainable quantization thresholds. By concatenating outputs from multiple sigmoids, introduced as bitwise soft quantization, it achieves trainable quantized values when integrated with a neural network. We compare our method to full-precision inference as well as to several quantization baselines. Experiments show that our approach outperforms standard quantization methods, while maintaining accuracy levels close to those of full-precision models. In particular, depending on the dataset, compression factors of $5\times$ to $16\times$ can be achieved compared to $32$-bit input without significant performance loss.

Trainable Bitwise Soft Quantization for Input Feature Compression

TL;DR

This work proposes a task-specific, trainable feature quantization layer that compresses the input features of a neural network, and outperforms standard quantization methods, while maintaining accuracy levels close to those of full-precision models.

Abstract

The growing demand for machine learning applications in the context of the Internet of Things calls for new approaches to optimize the use of limited compute and memory resources. Despite significant progress that has been made w.r.t. reducing model sizes and improving efficiency, many applications still require remote servers to provide the required resources. However, such approaches rely on transmitting data from edge devices to remote servers, which may not always be feasible due to bandwidth, latency, or energy constraints. We propose a task-specific, trainable feature quantization layer that compresses the input features of a neural network. This can significantly reduce the amount of data that needs to be transferred from the device to a remote server. In particular, the layer allows each input feature to be quantized to a user-defined number of bits, enabling a simple on-device compression at the time of data collection. The layer is designed to approximate step functions with sigmoids, enabling trainable quantization thresholds. By concatenating outputs from multiple sigmoids, introduced as bitwise soft quantization, it achieves trainable quantized values when integrated with a neural network. We compare our method to full-precision inference as well as to several quantization baselines. Experiments show that our approach outperforms standard quantization methods, while maintaining accuracy levels close to those of full-precision models. In particular, depending on the dataset, compression factors of to can be achieved compared to -bit input without significant performance loss.
Paper Structure (32 sections, 23 equations, 12 figures, 11 tables)

This paper contains 32 sections, 23 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: The trainable feature quantization layer is integrated into a neural network to learn task-specific compressions for each input feature: (a) During training, the quantization layer (green rectangle) and the neural network are trained jointly on a remote server. The layer gives rise to an encoder-decoder composition $\mathop{\mathrm{Q}}\nolimits_i= \mathop{\mathrm{D}}\nolimits_i \circ \mathop{\mathrm{E}}\nolimits_i$ for the $i$-th input feature. (b) During inference, the encoder $\mathop{\mathrm{E}}\nolimits_i$ is used to encode the $i$-th feature on the resource-constrained device using lightweight coding logic. The compressed features are then sent to the server, where the decoder $\mathop{\mathrm{D}}\nolimits_i$ is used to decode the $i$-th feature. All decoded features are then used as input for the remaining neural network, which is executed on the remote server.
  • Figure 2: Schematic overview: Soft Quantization ($\mathop{\mathrm{Q}}\nolimits^{\mathop{\mathrm{s}}\nolimits}$) and Bitwise Soft Quantization ($\mathop{\mathrm{Q}}\nolimits^{\mathop{\mathrm{bw}}\nolimits,\mathop{\mathrm{s}}\nolimits}$) are formed by summing or concatenating multiple soft step functions. Both are differentiable with respect to thresholds, enabling optimization during training. During inference, they are converted into Hard Quantization ($\mathop{\mathrm{Q}}\nolimits$) and Bitwise Quantization ($\mathop{\mathrm{Q}}\nolimits^{\mathop{\mathrm{bw}}\nolimits}$) via rounding.
  • Figure 3: Example of a quantization with bit width $n=2$ (i.e., $M=3$ thresholds) using minmax and quantile quantization on artificial skewed data. Left: Histogram of input data together with thresholds for minmax and quantile quantization. Center: Minmax and quantile quantization functions. Right: Histogram of quantized values for minmax and quantile quantization.
  • Figure 4: MSE values per dataset per quantization level of the best-performing hyperparameter setting averaged over 10 data splits for all methods. The red line is the average over the full precision measurements.
  • Figure 5: Visualization of different temperature schedules. All schedules start with a temperature value $\tau_{init} = 1$ and end with a temperature value $\tau_{end}$ after $n_{epochs}$ epochs.
  • ...and 7 more figures