Table of Contents
Fetching ...

HyperZ$\cdot$Z$\cdot$W Operator Connects Slow-Fast Networks for Full Context Interaction

Harvie Zhang

TL;DR

The paper confronts the limits of residual learning and expensive attention by introducing a slow-fast paradigm where a coordinate-based implicit MLP (slow network) generates large hyper-kernels that a fast CNN uses through the HyperZZW operator to achieve full context interaction at every layer. The Terminator architecture assembles a nine-branch SFNE block with global and local hyper-kernels, multiple gating and interaction modules, and a bottleneck, all trained with a slow neural loss that provides local feedback to the slow kernel generator. Key innovations include the HyperZZW operator, without relying on dot-product attention or pooling, and a normalization-free standardization scheme that yields stable, zero-mean features and faster convergence. Across pixel-level 1D and 2D benchmarks, Terminator demonstrates state-of-the-art performance with far fewer parameters and without residual connections, highlighting its potential for efficient long-range modeling in vision tasks.

Abstract

The self-attention mechanism utilizes large implicit weight matrices, programmed through dot product-based activations with very few trainable parameters, to enable long sequence modeling. In this paper, we investigate the possibility of discarding residual learning by employing large implicit kernels to achieve full context interaction at each layer of the network. To accomplish it, we introduce coordinate-based implicit MLPs as a slow network to generate hyper-kernels for another fast convolutional network. To get context-varying weights for fast dynamic encoding, we propose a $\mathrm{Hyper}\mathcal{Z{\cdot}Z{\cdot}W}$ operator that connects hyper-kernels ($\mathcal{W}$) and hidden activations ($\mathcal{Z}$) through simple elementwise multiplication, followed by convolution of $\mathcal{Z}$ using the context-dependent $\mathcal{W}$. Based on this design, we present a novel Terminator architecture that integrates hyper-kernels of different sizes to produce multi-branch hidden representations for enhancing the feature extraction capability of each layer. Additionally, a bottleneck layer is employed to compress the concatenated channels, allowing only valuable information to propagate to the subsequent layers. Notably, our model incorporates several innovative components and exhibits excellent properties, such as introducing local feedback error for updating the slow network, stable zero-mean features, faster training convergence, and fewer model parameters. Extensive experimental results on pixel-level 1D and 2D image classification benchmarks demonstrate the superior performance of our architecture.

HyperZ$\cdot$Z$\cdot$W Operator Connects Slow-Fast Networks for Full Context Interaction

TL;DR

The paper confronts the limits of residual learning and expensive attention by introducing a slow-fast paradigm where a coordinate-based implicit MLP (slow network) generates large hyper-kernels that a fast CNN uses through the HyperZZW operator to achieve full context interaction at every layer. The Terminator architecture assembles a nine-branch SFNE block with global and local hyper-kernels, multiple gating and interaction modules, and a bottleneck, all trained with a slow neural loss that provides local feedback to the slow kernel generator. Key innovations include the HyperZZW operator, without relying on dot-product attention or pooling, and a normalization-free standardization scheme that yields stable, zero-mean features and faster convergence. Across pixel-level 1D and 2D benchmarks, Terminator demonstrates state-of-the-art performance with far fewer parameters and without residual connections, highlighting its potential for efficient long-range modeling in vision tasks.

Abstract

The self-attention mechanism utilizes large implicit weight matrices, programmed through dot product-based activations with very few trainable parameters, to enable long sequence modeling. In this paper, we investigate the possibility of discarding residual learning by employing large implicit kernels to achieve full context interaction at each layer of the network. To accomplish it, we introduce coordinate-based implicit MLPs as a slow network to generate hyper-kernels for another fast convolutional network. To get context-varying weights for fast dynamic encoding, we propose a operator that connects hyper-kernels () and hidden activations () through simple elementwise multiplication, followed by convolution of using the context-dependent . Based on this design, we present a novel Terminator architecture that integrates hyper-kernels of different sizes to produce multi-branch hidden representations for enhancing the feature extraction capability of each layer. Additionally, a bottleneck layer is employed to compress the concatenated channels, allowing only valuable information to propagate to the subsequent layers. Notably, our model incorporates several innovative components and exhibits excellent properties, such as introducing local feedback error for updating the slow network, stable zero-mean features, faster training convergence, and fewer model parameters. Extensive experimental results on pixel-level 1D and 2D image classification benchmarks demonstrate the superior performance of our architecture.
Paper Structure (12 sections, 10 equations, 6 figures, 5 tables)

This paper contains 12 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison between the residual network and our Terminator architecture. 1) Our Slow-Fast Neural Encoding (SFNE) block employs a multi-branch structure, eliminating the need for residual learning (Figure \ref{['fig:sfne']}). 2) Our hidden layers do not utilize pooling layers for downsampling feature resolution. 3) We introduce a novel local feedback error for updating the slow network.
  • Figure 2: Visualization of the feature maps in each block. For the convenience of comparison, we enlarge the output of the 2$\sim$4 blocks of ResNet-152.
  • Figure 3: The overall framework of our Slow-Fast Neural Encoding (SFNE) block, which utilizes channel mixers and multi-scale hyper-kernels (e.g. $N$ is the input size) to construct a nine-branch structure. The ovals represent slow networks used to generate hyper-kernels, and the rectangles represent fast networks that interact directly with the input. The $\mathrm{Hyper}\mathcal{Z{\cdot}Z{\cdot}W}$ operator is formed by combining $\odot$, $\otimes$, and $\circledast$, enabling context-dependent fast weights. RGU and Si-GLU represent the recursive gated unit and the simplified gated linear unit.
  • Figure 4: Visualization of global hyper-kernels in each block. By performing elementwise multiplication between the sample activations and the global hyper-kernels $\mathbf{K}_g$, the model can effectively leverage context-dependent hyper-kernels $\hat{\mathbf{K}}_g$ to obtain pixel-level scores, especially when trained with the slow neural loss.
  • Figure 5: Diagrams illustrate our proposed hyper-channel interaction and hyper interaction mechanisms in our architecture.
  • ...and 1 more figures