Attention Via Convolutional Nearest Neighbors
Mingi Kang, Jeová Farias Sales Rocha Neto
TL;DR
ConvNN unifies convolution and self-attention under a single $k$-NN neighborhood aggregation framework, recovering both as special cases and enabling interpolation along a spectrum between local spatial and global feature-based neighbor selection. The method introduces similarity-based neighbor selection, a weighted aggregation via $\rho$, and a Conv1D-based fusion, with extensions for positional encoding, sparse search, and a hybrid branching layer that blends ConvNN with standard convolution. Empirical results on CIFAR-10/100 show consistent gains from hybrid branching in VGG-11 and competitive, often superior, performance to attention baselines in ViT-Tiny, with notable regularization benefits. The work offers a principled perspective on CNNs and Transformers, suggesting new avenues for interpretable and efficient vision architectures that balance locality and global context.
Abstract
The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and aggregation; convolution selects neighbors by spatial proximity, while attention selects by feature similarity, revealing they exist on a continuous spectrum. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. Crucially, ConvNN serves as a drop-in replacement for convolutional and attention layers, enabling systematic exploration of the intermediate spectrum between these two extremes. We validate the framework's coherence on CIFAR-10 and CIFAR-100 classification tasks across two complementary architectures: (1) Hybrid branching in VGG improves accuracy on both CIFAR datasets by combining spatial-proximity and feature-similarity selection; and (2) ConvNN in ViT outperforms standard attention and other attention variants on both datasets. Extensive ablations on $k$ values and architectural variants reveal that interpolating along this spectrum provides regularization benefits by balancing local and global receptive fields. Our work provides a unifying framework that dissolves the apparent distinction between convolution and attention, with implications for designing more principled and interpretable vision architectures.
