HyperZ$\cdot$Z$\cdot$W Operator Connects Slow-Fast Networks for Full Context Interaction
Harvie Zhang
TL;DR
The paper confronts the limits of residual learning and expensive attention by introducing a slow-fast paradigm where a coordinate-based implicit MLP (slow network) generates large hyper-kernels that a fast CNN uses through the HyperZZW operator to achieve full context interaction at every layer. The Terminator architecture assembles a nine-branch SFNE block with global and local hyper-kernels, multiple gating and interaction modules, and a bottleneck, all trained with a slow neural loss that provides local feedback to the slow kernel generator. Key innovations include the HyperZZW operator, without relying on dot-product attention or pooling, and a normalization-free standardization scheme that yields stable, zero-mean features and faster convergence. Across pixel-level 1D and 2D benchmarks, Terminator demonstrates state-of-the-art performance with far fewer parameters and without residual connections, highlighting its potential for efficient long-range modeling in vision tasks.
Abstract
The self-attention mechanism utilizes large implicit weight matrices, programmed through dot product-based activations with very few trainable parameters, to enable long sequence modeling. In this paper, we investigate the possibility of discarding residual learning by employing large implicit kernels to achieve full context interaction at each layer of the network. To accomplish it, we introduce coordinate-based implicit MLPs as a slow network to generate hyper-kernels for another fast convolutional network. To get context-varying weights for fast dynamic encoding, we propose a $\mathrm{Hyper}\mathcal{Z{\cdot}Z{\cdot}W}$ operator that connects hyper-kernels ($\mathcal{W}$) and hidden activations ($\mathcal{Z}$) through simple elementwise multiplication, followed by convolution of $\mathcal{Z}$ using the context-dependent $\mathcal{W}$. Based on this design, we present a novel Terminator architecture that integrates hyper-kernels of different sizes to produce multi-branch hidden representations for enhancing the feature extraction capability of each layer. Additionally, a bottleneck layer is employed to compress the concatenated channels, allowing only valuable information to propagate to the subsequent layers. Notably, our model incorporates several innovative components and exhibits excellent properties, such as introducing local feedback error for updating the slow network, stable zero-mean features, faster training convergence, and fewer model parameters. Extensive experimental results on pixel-level 1D and 2D image classification benchmarks demonstrate the superior performance of our architecture.
