Table of Contents
Fetching ...

Pixel Embedding: Fully Quantized Convolutional Neural Network with Differentiable Lookup Table

Hiroyuki Tokunaga, Joel Nicholls, Daria Vazhenina, Atsunori Kanemura

TL;DR

This work tackles the bottleneck of quantizing the first convolutional layer by introducing pixel embedding, a differentiable input embedding that maps 8-bit pixels to low-bit, trainable vectors via an embedding lookup table. Training proceeds end-to-end with backpropagation and a straight-through estimator, while inference merges the learned embeddings into a fixed $Q$-bit table, enabling fully quantized networks. Empirical results on ImageNet and CIFAR-100 show that pixel embedding dramatically narrows the accuracy gap caused by first-layer quantization and achieves notable FPGA-based speedups (about 1.7x compared to floating-point first layers). The approach also demonstrates robust convergence and favorable accuracy-vs-speed tradeoffs, suggesting practical viability for on-device deep learning tasks.

Abstract

By quantizing network weights and activations to low bitwidth, we can obtain hardware-friendly and energy-efficient networks. However, existing quantization techniques utilizing the straight-through estimator and piecewise constant functions face the issue of how to represent originally high-bit input data with low-bit values. To fully quantize deep neural networks, we propose pixel embedding, which replaces each float-valued input pixel with a vector of quantized values by using a lookup table. The lookup table or low-bit representation of pixels is differentiable and trainable by backpropagation. Such replacement of inputs with vectors is similar to word embedding in the natural language processing field. Experiments on ImageNet and CIFAR-100 show that pixel embedding reduces the top-5 error gap caused by quantizing the floating points at the first layer to only 1% for the ImageNet dataset, and the top-1 error gap caused by quantizing first and last layers to slightly over 1% for the CIFAR-100 dataset. The usefulness of pixel embedding is further demonstrated by inference time measurements, which demonstrate over 1.7 times speedup compared to floating point precision first layer.

Pixel Embedding: Fully Quantized Convolutional Neural Network with Differentiable Lookup Table

TL;DR

This work tackles the bottleneck of quantizing the first convolutional layer by introducing pixel embedding, a differentiable input embedding that maps 8-bit pixels to low-bit, trainable vectors via an embedding lookup table. Training proceeds end-to-end with backpropagation and a straight-through estimator, while inference merges the learned embeddings into a fixed -bit table, enabling fully quantized networks. Empirical results on ImageNet and CIFAR-100 show that pixel embedding dramatically narrows the accuracy gap caused by first-layer quantization and achieves notable FPGA-based speedups (about 1.7x compared to floating-point first layers). The approach also demonstrates robust convergence and favorable accuracy-vs-speed tradeoffs, suggesting practical viability for on-device deep learning tasks.

Abstract

By quantizing network weights and activations to low bitwidth, we can obtain hardware-friendly and energy-efficient networks. However, existing quantization techniques utilizing the straight-through estimator and piecewise constant functions face the issue of how to represent originally high-bit input data with low-bit values. To fully quantize deep neural networks, we propose pixel embedding, which replaces each float-valued input pixel with a vector of quantized values by using a lookup table. The lookup table or low-bit representation of pixels is differentiable and trainable by backpropagation. Such replacement of inputs with vectors is similar to word embedding in the natural language processing field. Experiments on ImageNet and CIFAR-100 show that pixel embedding reduces the top-5 error gap caused by quantizing the floating points at the first layer to only 1% for the ImageNet dataset, and the top-1 error gap caused by quantizing first and last layers to slightly over 1% for the CIFAR-100 dataset. The usefulness of pixel embedding is further demonstrated by inference time measurements, which demonstrate over 1.7 times speedup compared to floating point precision first layer.
Paper Structure (11 sections, 4 equations, 4 figures, 2 tables)

This paper contains 11 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: $d = 4$, $Q = 2$
  • Figure 2: $d = 5$, $Q = 1$
  • Figure 4: The validation curves of the four methods on the ImageNet dataset. Full precision refers to the network having full precision first layer and quantized intermediate layers. These plots show that pixel embedding is not just superior to naive quantization, but also convergence is more stable.
  • Figure 5: Scatter plot of models, evaluated on the two objectives of test accuracy versus inference time. All models use ResNet-18 for ImageNet classification. The vertical dashed line indicates the inference time of the fastest model (input and weight quantization). The horizontal dashed line indicates the test accuracy of the most accurate model (full precision). Pixel embedding is close to the intersection of these dashed lines, so it is strong in both objectives.