Table of Contents
Fetching ...

Receptive Field Expanded Look-Up Tables for Vision Inference: Advancing from Low-level to High-level Tasks

Xi Zhang, Xiaolin Wu

TL;DR

This work addresses the efficiency-accuracy gap in LUT-based CNN inference by expanding the receptive field without increasing memory. It introduces RFE-LUT, which combines differentiable lattice vector quantization (LVQ) with per-dimension quantization, irregular dilated convolutions (IDC), and a U-shaped cascaded LUT (U-LUT) to capture both local detail and global context. The approach yields state-of-the-art gains among LUT-based methods for high-level vision tasks like nucleus and salient object segmentation, and strong results on low-level restoration such as image super-resolution, all with orders of magnitude smaller storage and faster runtimes than CNN baselines. This framework enables real-time, resource-efficient vision inference on mobile and embedded devices while preserving competitive accuracy.

Abstract

Recently, several look-up table (LUT) methods were developed to greatly expedite the inference of CNNs in a classical strategy of trading space for speed. However, these LUT methods suffer from a common drawback of limited receptive field of the convolution kernels due to the combinatorial explosion of table size. This research aims to expand the CNN receptive field with a fixed table size, thereby enhancing the performance of LUT-driven fast CNN inference while maintaining the same space complexity. To achieve this goal, various techniques are proposed. The main contribution is a novel approach of learning an optimal lattice vector quantizer that adaptively allocates the quantization resolution across data dimensions based on their significance to the inference task. In addition, the lattice vector quantizer offers an inherently more accurate approximation of CNN kernels than scalar quantizer as used in current practice. Furthermore, we introduce other receptive field expansion strategies, including irregular dilated convolutions and a U-shaped cascaded LUT structure, designed to capture multi-level contextual information without inflating table size. Together, these innovations allow our approach to effectively balance speed, accuracy, and memory efficiency, demonstrating significant improvements over existing LUT methods.

Receptive Field Expanded Look-Up Tables for Vision Inference: Advancing from Low-level to High-level Tasks

TL;DR

This work addresses the efficiency-accuracy gap in LUT-based CNN inference by expanding the receptive field without increasing memory. It introduces RFE-LUT, which combines differentiable lattice vector quantization (LVQ) with per-dimension quantization, irregular dilated convolutions (IDC), and a U-shaped cascaded LUT (U-LUT) to capture both local detail and global context. The approach yields state-of-the-art gains among LUT-based methods for high-level vision tasks like nucleus and salient object segmentation, and strong results on low-level restoration such as image super-resolution, all with orders of magnitude smaller storage and faster runtimes than CNN baselines. This framework enables real-time, resource-efficient vision inference on mobile and embedded devices while preserving competitive accuracy.

Abstract

Recently, several look-up table (LUT) methods were developed to greatly expedite the inference of CNNs in a classical strategy of trading space for speed. However, these LUT methods suffer from a common drawback of limited receptive field of the convolution kernels due to the combinatorial explosion of table size. This research aims to expand the CNN receptive field with a fixed table size, thereby enhancing the performance of LUT-driven fast CNN inference while maintaining the same space complexity. To achieve this goal, various techniques are proposed. The main contribution is a novel approach of learning an optimal lattice vector quantizer that adaptively allocates the quantization resolution across data dimensions based on their significance to the inference task. In addition, the lattice vector quantizer offers an inherently more accurate approximation of CNN kernels than scalar quantizer as used in current practice. Furthermore, we introduce other receptive field expansion strategies, including irregular dilated convolutions and a U-shaped cascaded LUT structure, designed to capture multi-level contextual information without inflating table size. Together, these innovations allow our approach to effectively balance speed, accuracy, and memory efficiency, demonstrating significant improvements over existing LUT methods.

Paper Structure

This paper contains 35 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of uniform scalar quantization (USQ) and hexagonal lattice vector quantization (LVQ) in two dimensions. The left plot shows the square lattice used in USQ, whose Voronoi cells are axis-aligned squares. The right plot shows the hexagonal $A_2$ lattice used in LVQ, where the Voronoi cells form regular hexagons, achieving more efficient space filling and lower quantization error.
  • Figure 2: Comparison of regular dilated convolution (RDC) and irregular dilated convolution (IDC). In RDC, the dilation rate remains consistent across the convolution layers, resulting in uniformly spaced receptive fields. Conversely, IDC introduces variable dilation rates, enabling a flexible receptive field that captures both local and global contextual information.
  • Figure 3: Overview of the proposed U-shaped cascaded LUT framework. (a) End-to-end pipeline: RGB channels are first processed by a channel-wise LUT, then passed through a cascade of LUT pools with skip connections to aggregate long- and short-range context, producing the final prediction. (b) Structure of a LUT pool: several parallel LUTs operate in tandem; each uses a distinct regular or irregular dilated convolution in the first layer to set the receptive field, followed by $1\times1$ layers; their outputs are averaged. (c) “LUT-ization” of a small CNN: responses of the trained LUT-Network are enumerated and stored in a 4-D LUT; a local $2\times2$ window is rotated by $\{0^\circ,90^\circ,180^\circ,270^\circ\}$ for increasing the receptive field.
  • Figure 4: Qualitative comparison on DSB2018 (nucleus segmentation). Each row shows the input image, predictions from prior LUT methods (SR-LUT, MuLUT, DFC-LUT), our results, and the ground truth (GT). Compared with baselines, RFE-LUT produces sharper nuclear boundaries, fewer false positives in background regions, and better separation of touching nuclei, especially in crowded areas and along thin structures. Zoomed-in patches highlight improved boundary adherence and interior completeness, consistent with our lower HD and higher DSC/MIOU scores in the quantitative tables.
  • Figure 5: Visual comparison on the DUTS test set. From left to right: input image, predictions by prior LUT methods (SR-LUT, MuLUT, DFC-LUT), our results and ground-truth mask. Our results better preserve object boundaries and thin structures, suppress background clutter, and yield more complete salient regions, illustrating the benefit of receptive-field expansion with LVQ-based indexing. All methods are shown at the same resolution and without post-processing.