Table of Contents
Fetching ...

Squeeze-and-Remember Block

Rinor Cakaj, Jens Mehnert, Bin Yang

TL;DR

The Squeeze-and-Remember (SR) block adds a dynamic memory-like mechanism to CNNs by squeezing the input with a $1 \times 1$ convolution, recalling learned high-level features through an FCN-guided weighting of memory blocks, and adding the resulting memory to the original feature map. This design enables context-aware feature augmentation in non-sequential image tasks, yielding measurable gains on ImageNet (top-1) and Cityscapes (mIoU) with modest parameter and compute overhead. Empirical results across CIFAR, ImageNet, and Cityscapes demonstrate consistent improvements, especially when combined with regularizers like dropout2d and SE/CBAM blocks, while analyses reveal class-dependent memory utilization. The work positions SR as a complementary alternative to recalibration-based attention, expanding CNNs’ capability to remember and reuse learned features for improved inference in diverse visual tasks.

Abstract

Convolutional Neural Networks (CNNs) are important for many machine learning tasks. They are built with different types of layers: convolutional layers that detect features, dropout layers that help to avoid over-reliance on any single neuron, and residual layers that allow the reuse of features. However, CNNs lack a dynamic feature retention mechanism similar to the human brain's memory, limiting their ability to use learned information in new contexts. To bridge this gap, we introduce the "Squeeze-and-Remember" (SR) block, a novel architectural unit that gives CNNs dynamic memory-like functionalities. The SR block selectively memorizes important features during training, and then adaptively re-applies these features during inference. This improves the network's ability to make contextually informed predictions. Empirical results on ImageNet and Cityscapes datasets demonstrate the SR block's efficacy: integration into ResNet50 improved top-1 validation accuracy on ImageNet by 0.52% over dropout2d alone, and its application in DeepLab v3 increased mean Intersection over Union in Cityscapes by 0.20%. These improvements are achieved with minimal computational overhead. This show the SR block's potential to enhance the capabilities of CNNs in image processing tasks.

Squeeze-and-Remember Block

TL;DR

The Squeeze-and-Remember (SR) block adds a dynamic memory-like mechanism to CNNs by squeezing the input with a convolution, recalling learned high-level features through an FCN-guided weighting of memory blocks, and adding the resulting memory to the original feature map. This design enables context-aware feature augmentation in non-sequential image tasks, yielding measurable gains on ImageNet (top-1) and Cityscapes (mIoU) with modest parameter and compute overhead. Empirical results across CIFAR, ImageNet, and Cityscapes demonstrate consistent improvements, especially when combined with regularizers like dropout2d and SE/CBAM blocks, while analyses reveal class-dependent memory utilization. The work positions SR as a complementary alternative to recalibration-based attention, expanding CNNs’ capability to remember and reuse learned features for improved inference in diverse visual tasks.

Abstract

Convolutional Neural Networks (CNNs) are important for many machine learning tasks. They are built with different types of layers: convolutional layers that detect features, dropout layers that help to avoid over-reliance on any single neuron, and residual layers that allow the reuse of features. However, CNNs lack a dynamic feature retention mechanism similar to the human brain's memory, limiting their ability to use learned information in new contexts. To bridge this gap, we introduce the "Squeeze-and-Remember" (SR) block, a novel architectural unit that gives CNNs dynamic memory-like functionalities. The SR block selectively memorizes important features during training, and then adaptively re-applies these features during inference. This improves the network's ability to make contextually informed predictions. Empirical results on ImageNet and Cityscapes datasets demonstrate the SR block's efficacy: integration into ResNet50 improved top-1 validation accuracy on ImageNet by 0.52% over dropout2d alone, and its application in DeepLab v3 increased mean Intersection over Union in Cityscapes by 0.20%. These improvements are achieved with minimal computational overhead. This show the SR block's potential to enhance the capabilities of CNNs in image processing tasks.
Paper Structure (32 sections, 1 equation, 7 figures, 5 tables)

This paper contains 32 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The Squeeze-and-Remember (SR) Block Architecture, operating in three stages: (1) "Squeeze": Applies a $1 \times 1$ convolution to input feature map $X$, yielding a reduced feature map $\bar{X}$ that retains essential characteristics of $X$. (2) "Remember": In this stage, $\bar{X}$ is flattened and passed through a Fully Connected Network (FCN) for weight calculation, followed by the computation of a weighted sum from memory blocks $M_1$ to $M_P$. These memory blocks represent a feature-spanning set, designed to capture a comprehensive range of features. The computed weighted sum, $\bar{M}$, contains high-level, possibly undetected or enhanced features such as cliff textures or architectural details. (3) "Add": This final step adds $\bar{M}$ to $X$, producing the output feature map $\hat{X}$.
  • Figure 2: Class-Conditional FCN Activation Patterns: Softmax activation means and standard deviations across ten memory blocks for "cliff", "pug", and "church" illustrate different strategies of memory usage. The "cliff" class shows diverse but consistent activation in contrast to "pug" and "church".
  • Figure 3: Different Impact of SR Block on Feature Maps: The figure illustrates class-dependent feature map transformations for "Cliff", "Pug", and "Church" classes in the SR block. It contrasts channel-wise activations before and after SR processing, with Mean Absolute Differences.
  • Figure 4: Feature Encoding in SR-Enhanced ResNet50: The figures $M_1$ through $M_{10}$ show the average channel activity in the ten memory blocks in the SR block. The variation in activation patterns demonstrates the memory blocks complex encoding capabilities. Each block selectively capturing different aspects of the feature spectrum.
  • Figure 5: This figure shows the means and standard deviations of softmax activations for "all", reflecting the activation across the validation set, and for specific classes, including "goldfish", "cliff", "plane", "pug", and "church". It illustrates the unique memory usage within a ResNet50 using a SR block with two memory blocks and trained with dropout2d. Notably, "cliff" shows a wide range of activation behaviors that differ from the more uniform patterns of "pug" and "church", highlighting the ability of the FCN to adapt feature processing in a class-dependent manner.
  • ...and 2 more figures