Table of Contents
Fetching ...

AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards

David Smerkous, Zian Wang, Behzad Najafian

TL;DR

This work introduces AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging, and developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage representation collapse via deep supervision.

Abstract

Self-supervised pretraining has transformed computer vision by enabling data-efficient fine-tuning, yet high-resolution training typically requires server-scale infrastructure, limiting in-domain foundation model development for many research laboratories. Masked Autoencoders (MAE) reduce computation by encoding only visible tokens, but combining MAE with hierarchical downsampling architectures remains structurally challenging due to dense grid priors and mask-aware design compromises. We introduce AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging. By discarding masked tokens and performing dynamic merging exclusively over visible tokens, AFFMAE removes dense-grid assumptions while preserving hierarchical scalability. We developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage representation collapse via deep supervision. On high-resolution electron microscopy segmentation, AFFMAE matches ViT-MAE performance at equal parameter count while reducing FLOPs by up to 7x, halving memory usage, and achieving faster training on a single RTX 5090. Code available at https://github.com/najafian-lab/affmae.

AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards

TL;DR

This work introduces AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging, and developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage representation collapse via deep supervision.

Abstract

Self-supervised pretraining has transformed computer vision by enabling data-efficient fine-tuning, yet high-resolution training typically requires server-scale infrastructure, limiting in-domain foundation model development for many research laboratories. Masked Autoencoders (MAE) reduce computation by encoding only visible tokens, but combining MAE with hierarchical downsampling architectures remains structurally challenging due to dense grid priors and mask-aware design compromises. We introduce AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging. By discarding masked tokens and performing dynamic merging exclusively over visible tokens, AFFMAE removes dense-grid assumptions while preserving hierarchical scalability. We developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage representation collapse via deep supervision. On high-resolution electron microscopy segmentation, AFFMAE matches ViT-MAE performance at equal parameter count while reducing FLOPs by up to 7x, halving memory usage, and achieving faster training on a single RTX 5090. Code available at https://github.com/najafian-lab/affmae.
Paper Structure (32 sections, 7 equations, 11 figures, 2 tables)

This paper contains 32 sections, 7 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: mIoU score on the Foot Process Width Segmentation dataset vs. Model FLOPs. Both ViT and AFF (Ours) were pre-trained MAE-style on over 170k unlabeled electron microscopy images. AFFMAE demonstrates up to a 6$\times$ reduction in FLOPs at high resolutions while maintaining performance comparable to the ViT baseline.
  • Figure 2: Three main downsampling approaches with a masked encoder. From left to right: Grid based merging: typically a patch of 4x4 tokens are merged into one output token on a grid, utilizing a learned masked token during encoding that is discarded during finetuning. We explore two novel approaches for MAE, the first being downsampling via token pruning, and a cluster point-based KNN token merging of visible tokens, both operate on visible tokens only and allow variable downsampling rates.
  • Figure 3: Visualization of adaptive token downsampling learned via AFFMAE. ($d_s=0.4$) Starting from a dense uniform grid (a), AFF dynamically merges tokens in homogeneous regions while preserving high token density along complex structures. By the final stage (d), total token count is reduced by $\approx 94\%$.
  • Figure 4: PCA projections of the learned representations. To provide a dense spatial grid rather than a scattered set of tokens, these features are visualized after the decoder's cross-attention. Without deep supervision (center), features collapse into homogeneous states at the deepest stage. Deep supervision (right) successfully preserves rich feature diversity at the deepest stage.
  • Figure 5: Normalized Effective Rank. The evolution of representation rank across layers, computed from the encoder tokens.
  • ...and 6 more figures