Table of Contents
Fetching ...

FFNet: MetaMixer-based Efficient Convolutional Mixer Design

Seokju Yun, Dongheon Lee, Youngmin Ro

TL;DR

This paper reframes the Transformer backbone around a generalized MetaMixer that encapsulates the query-key-value framework and demonstrates that explicit sub-operations can be left unspecified. By FFNifying self-attention—replacing dot-product interactions and softmax with depthwise convolutions and GELU, and employing large kernels—the authors derive FFNified attention, a fast, convolutional token mixer. The Fast-Forward Network (FFNet) family combines FFNified attention with ConvNeXt-style channel mixing to deliver competitive performance across image recognition, segmentation, detection, super-resolution, 3D perception, and time-series forecasting, while achieving superior speed on diverse devices. The results suggest MetaMixer as a unifying, efficient backbone that leverages both attention-inspired and convolution-inspired mechanisms, with practical impact on resource-constrained applications and multi-domain tasks.

Abstract

Transformer, composed of self-attention and Feed-Forward Network, has revolutionized the landscape of network design across various vision tasks. While self-attention is extensively explored as a key factor in performance, FFN has received little attention. FFN is a versatile operator seamlessly integrated into nearly all AI models to effectively harness rich representations. Recent works also show that FFN functions like key-value memories. Thus, akin to the query-key-value mechanism within self-attention, FFN can be viewed as a memory network, where the input serves as query and the two projection weights operate as keys and values, respectively. Based on these observations, we hypothesize that the importance lies in query-key-value framework itself for competitive performance. To verify this, we propose converting self-attention into a more FFN-like efficient token mixer with only convolutions while retaining query-key-value framework, namely FFNification. Specifically, FFNification replaces query-key-value interactions with large kernel convolutions and adopts GELU activation function instead of softmax. The derived token mixer, FFNified attention, serves as key-value memories for detecting locally distributed spatial patterns, and operates in the opposite dimension to the ConvNeXt block within each corresponding sub-operation of the query-key-value framework. Building upon the above two modules, we present a family of Fast-Forward Networks (FFNet). Despite being composed of only simple operators, FFNet outperforms sophisticated and highly specialized methods in each domain, with notable efficiency gains. These results validate our hypothesis, leading us to propose MetaMixer, a general mixer architecture that does not specify sub-operations within the query-key-value framework.

FFNet: MetaMixer-based Efficient Convolutional Mixer Design

TL;DR

This paper reframes the Transformer backbone around a generalized MetaMixer that encapsulates the query-key-value framework and demonstrates that explicit sub-operations can be left unspecified. By FFNifying self-attention—replacing dot-product interactions and softmax with depthwise convolutions and GELU, and employing large kernels—the authors derive FFNified attention, a fast, convolutional token mixer. The Fast-Forward Network (FFNet) family combines FFNified attention with ConvNeXt-style channel mixing to deliver competitive performance across image recognition, segmentation, detection, super-resolution, 3D perception, and time-series forecasting, while achieving superior speed on diverse devices. The results suggest MetaMixer as a unifying, efficient backbone that leverages both attention-inspired and convolution-inspired mechanisms, with practical impact on resource-constrained applications and multi-domain tasks.

Abstract

Transformer, composed of self-attention and Feed-Forward Network, has revolutionized the landscape of network design across various vision tasks. While self-attention is extensively explored as a key factor in performance, FFN has received little attention. FFN is a versatile operator seamlessly integrated into nearly all AI models to effectively harness rich representations. Recent works also show that FFN functions like key-value memories. Thus, akin to the query-key-value mechanism within self-attention, FFN can be viewed as a memory network, where the input serves as query and the two projection weights operate as keys and values, respectively. Based on these observations, we hypothesize that the importance lies in query-key-value framework itself for competitive performance. To verify this, we propose converting self-attention into a more FFN-like efficient token mixer with only convolutions while retaining query-key-value framework, namely FFNification. Specifically, FFNification replaces query-key-value interactions with large kernel convolutions and adopts GELU activation function instead of softmax. The derived token mixer, FFNified attention, serves as key-value memories for detecting locally distributed spatial patterns, and operates in the opposite dimension to the ConvNeXt block within each corresponding sub-operation of the query-key-value framework. Building upon the above two modules, we present a family of Fast-Forward Networks (FFNet). Despite being composed of only simple operators, FFNet outperforms sophisticated and highly specialized methods in each domain, with notable efficiency gains. These results validate our hypothesis, leading us to propose MetaMixer, a general mixer architecture that does not specify sub-operations within the query-key-value framework.
Paper Structure (26 sections, 2 equations, 14 figures, 11 tables, 2 algorithms)

This paper contains 26 sections, 2 equations, 14 figures, 11 tables, 2 algorithms.

Figures (14)

  • Figure 1: Overview of MetaMixer. (a) MetaMixer is derived by not specifying sub-operations within the query-key-value framework. We assert that the competence of Transformers primarily originates from MetaMixer, which we deem as the true backbone of Transformer. (b) To demonstrate this and propose a FFN-like efficient token mixer, we replace the inefficient sub-operations of self-attention with those from FFN within MetaMixer structure.
  • Figure 2: Key-Value Mechanism of FFNs. (a) Coefficient Sparsity: Astonishingly, the final stage shows significantly higher sparsity. To categorize numerous classes, a small subset ($<$ 10%) of neurons are activated. (b) Coefficient Map corresponding to the most activated key in the last layer: Keys specialized for each class selectively correlate at the target regions, suggesting their potential role in capturing distinctive visual features. The numbers in parentheses indicate the average values of the coefficients.
  • Figure 3: Overview of FFNification and Fast-Forward Network block. (a-b) Comparison between self-attention and FFNified attention; (c) Our mixer design easily adapts by selecting the convolution type and kernel size based on the modality.
  • Figure 4: Examples of coefficient maps corresponding to the most activated key from the mixers in the final block of FFNet-3.
  • Figure 5: Visualization of the FFN's coefficient map in the last layer of PoolFormer-M36 yu2021metaformer. (a) We visualize coefficient maps corresponding to the most activated keys for each class, where class-specific keys consistently correlate with decisive locations. The numbers in parentheses indicate the average values of the coefficients. (b) We identify keys that detect concepts shared across classes, such as animal species and machine parts, revealing that the keys capture patterns in the input through a hierarchical manner. Specifically, $K_{226}$ is mostly inactive except in classes that include wheels.
  • ...and 9 more figures