Table of Contents
Fetching ...

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Jiangning Zhang, Xiangtai Li, Yabiao Wang, Chengjie Wang, Yibo Yang, Yong Liu, Dacheng Tao

TL;DR

EATFormer reframes Vision Transformers through Evolutionary Algorithms, deriving a coherent mapping between EA operators and Transformer components. It introduces a pyramid EATFormer backbone composed solely of the EAT block—encompassing Multi-Scale Region Aggregation, Global and Local Interaction, and FFN—augmented by a Modulated Deformable MSA and a Task-Related Head for flexible task fusion. Across ImageNet, COCO, and ADE20K, EATFormer variants demonstrate superior efficiency and competitive accuracy, outperforming several state-of-the-art peers while reducing parameters and FLOPs. Extensive ablations and explanatory analyses validate the contributions of MSRA, GLI, MD-MSA, and TRH, highlighting the practical impact of incorporating EA-inspired global/local modeling and deformable attention into vision backbones.

Abstract

Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://github.com/zhangzjn/EATFormer.

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

TL;DR

EATFormer reframes Vision Transformers through Evolutionary Algorithms, deriving a coherent mapping between EA operators and Transformer components. It introduces a pyramid EATFormer backbone composed solely of the EAT block—encompassing Multi-Scale Region Aggregation, Global and Local Interaction, and FFN—augmented by a Modulated Deformable MSA and a Task-Related Head for flexible task fusion. Across ImageNet, COCO, and ADE20K, EATFormer variants demonstrate superior efficiency and competitive accuracy, outperforming several state-of-the-art peers while reducing parameters and FLOPs. Extensive ablations and explanatory analyses validate the contributions of MSRA, GLI, MD-MSA, and TRH, highlighting the practical impact of incorporating EA-inspired global/local modeling and deformable attention into vision backbones.

Abstract

Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://github.com/zhangzjn/EATFormer.
Paper Structure (45 sections, 6 equations, 13 figures, 14 tables)

This paper contains 45 sections, 6 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Comparisons of structural analogousness between (a)-Transformer module and (b)-evolutionary algorithm, where they have analogous concepts of 1) individual definition (Token Embedding vs. Individual), 2) global information interaction (MSA vs. Crossover), 3) individual feature enhancement (FFN vs. Mutation), 4) feature inheritance (Skip Connection vs. Population Succession), etc. (c)-Intuitive illustration of some EA variants.
  • Figure 2: Paradigm of the proposed basic EAT Block, which contains three $y = f(x) + x$ residuals to model: (a) multi-scale information aggregation, (b) feature interactions among tokens, and (c) individual enhancement.
  • Figure 3: Comparison with SOTAs in terms of Top-1 vs. GPU throughput. All models are trained only with ImageNet-1K imagenet dataset in 224$\times$224, and the radius represents the relative number of parameters.
  • Figure 5: Structure of the proposed MD-MSA.
  • Figure 6: Structure of the proposed Task-Related Head.
  • ...and 8 more figures