FFNet: MetaMixer-based Efficient Convolutional Mixer Design
Seokju Yun, Dongheon Lee, Youngmin Ro
TL;DR
This paper reframes the Transformer backbone around a generalized MetaMixer that encapsulates the query-key-value framework and demonstrates that explicit sub-operations can be left unspecified. By FFNifying self-attention—replacing dot-product interactions and softmax with depthwise convolutions and GELU, and employing large kernels—the authors derive FFNified attention, a fast, convolutional token mixer. The Fast-Forward Network (FFNet) family combines FFNified attention with ConvNeXt-style channel mixing to deliver competitive performance across image recognition, segmentation, detection, super-resolution, 3D perception, and time-series forecasting, while achieving superior speed on diverse devices. The results suggest MetaMixer as a unifying, efficient backbone that leverages both attention-inspired and convolution-inspired mechanisms, with practical impact on resource-constrained applications and multi-domain tasks.
Abstract
Transformer, composed of self-attention and Feed-Forward Network, has revolutionized the landscape of network design across various vision tasks. While self-attention is extensively explored as a key factor in performance, FFN has received little attention. FFN is a versatile operator seamlessly integrated into nearly all AI models to effectively harness rich representations. Recent works also show that FFN functions like key-value memories. Thus, akin to the query-key-value mechanism within self-attention, FFN can be viewed as a memory network, where the input serves as query and the two projection weights operate as keys and values, respectively. Based on these observations, we hypothesize that the importance lies in query-key-value framework itself for competitive performance. To verify this, we propose converting self-attention into a more FFN-like efficient token mixer with only convolutions while retaining query-key-value framework, namely FFNification. Specifically, FFNification replaces query-key-value interactions with large kernel convolutions and adopts GELU activation function instead of softmax. The derived token mixer, FFNified attention, serves as key-value memories for detecting locally distributed spatial patterns, and operates in the opposite dimension to the ConvNeXt block within each corresponding sub-operation of the query-key-value framework. Building upon the above two modules, we present a family of Fast-Forward Networks (FFNet). Despite being composed of only simple operators, FFNet outperforms sophisticated and highly specialized methods in each domain, with notable efficiency gains. These results validate our hypothesis, leading us to propose MetaMixer, a general mixer architecture that does not specify sub-operations within the query-key-value framework.
