Table of Contents
Fetching ...

CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework

Jiaxuan Li, Qing Xu, Xiangjian He, Ziyu Liu, Chang Xing, Zhen Chen, Daokun Zhang, Rong Qu, Chang Wen Chen

TL;DR

CoMA tackles inefficiencies in MAE-style pretraining by enforcing uniform pixel-wise supervision through complementary masking and by replacing fixed-resolution ViT reuse with DyViT, a hierarchical transformer using Dynamic Multi-Window Self-Attention. The dual-branch masking keeps training efficient while ensuring dense supervision, and the DM-MSA enables multi-scale feature learning with fewer parameters and FLOPs. Empirical results show CoMA pretraining yields competitive or superior downstream performance (e.g., 84.1% top-1 on ImageNet-1K with ViT-B at 800 epochs) and faster convergence, plus strong gains in semantic segmentation (ADE20K 51.5 mIoU) and object detection/instance segmentation on COCO, while reducing pretraining time by roughly 10% compared to MAE. Overall, the framework demonstrates improved data utilization, learning efficiency, and multi-scale perception, making pretraining more practical for large-scale vision transformers.

Abstract

Masked Autoencoders (MAE) achieve self-supervised learning of image representations by randomly removing a portion of visual tokens and reconstructing the original image as a pretext task, thereby significantly enhancing pretraining efficiency and yielding excellent adaptability across downstream tasks. However, MAE and other MAE-style paradigms that adopt random masking generally require more pre-training epochs to maintain adaptability. Meanwhile, ViT in MAE suffers from inefficient parameter use due to fixed spatial resolution across layers. To overcome these limitations, we propose the Complementary Masked Autoencoders (CoMA), which employ a complementary masking strategy to ensure uniform sampling across all pixels, thereby improving effective learning of all features and enhancing the model's adaptability. Furthermore, we introduce DyViT, a hierarchical vision transformer that employs a Dynamic Multi-Window Self-Attention (DM-MSA), significantly reducing the parameters and FLOPs while improving fine-grained feature learning. Pre-trained on ImageNet-1K with CoMA, DyViT matches the downstream performance of MAE using only 12% of the pre-training epochs, demonstrating more effective learning. It also attains a 10% reduction in pre-training time per epoch, further underscoring its superior pre-training efficiency.

CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework

TL;DR

CoMA tackles inefficiencies in MAE-style pretraining by enforcing uniform pixel-wise supervision through complementary masking and by replacing fixed-resolution ViT reuse with DyViT, a hierarchical transformer using Dynamic Multi-Window Self-Attention. The dual-branch masking keeps training efficient while ensuring dense supervision, and the DM-MSA enables multi-scale feature learning with fewer parameters and FLOPs. Empirical results show CoMA pretraining yields competitive or superior downstream performance (e.g., 84.1% top-1 on ImageNet-1K with ViT-B at 800 epochs) and faster convergence, plus strong gains in semantic segmentation (ADE20K 51.5 mIoU) and object detection/instance segmentation on COCO, while reducing pretraining time by roughly 10% compared to MAE. Overall, the framework demonstrates improved data utilization, learning efficiency, and multi-scale perception, making pretraining more practical for large-scale vision transformers.

Abstract

Masked Autoencoders (MAE) achieve self-supervised learning of image representations by randomly removing a portion of visual tokens and reconstructing the original image as a pretext task, thereby significantly enhancing pretraining efficiency and yielding excellent adaptability across downstream tasks. However, MAE and other MAE-style paradigms that adopt random masking generally require more pre-training epochs to maintain adaptability. Meanwhile, ViT in MAE suffers from inefficient parameter use due to fixed spatial resolution across layers. To overcome these limitations, we propose the Complementary Masked Autoencoders (CoMA), which employ a complementary masking strategy to ensure uniform sampling across all pixels, thereby improving effective learning of all features and enhancing the model's adaptability. Furthermore, we introduce DyViT, a hierarchical vision transformer that employs a Dynamic Multi-Window Self-Attention (DM-MSA), significantly reducing the parameters and FLOPs while improving fine-grained feature learning. Pre-trained on ImageNet-1K with CoMA, DyViT matches the downstream performance of MAE using only 12% of the pre-training epochs, demonstrating more effective learning. It also attains a 10% reduction in pre-training time per epoch, further underscoring its superior pre-training efficiency.

Paper Structure

This paper contains 12 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Visualization of the relationship between the number of pre-training iterations and ImageNet-1K classification accuracy for the base versions of MAE he2022masked, CAE chen2024context, ColorMAE hinojosa2024colormae, and CoMA. The radius of each data point represents the total pre-training time, with larger dots indicating longer durations. Specific pre-training hours are annotated beside each point. All models were trained using a 60% masking ratio.
  • Figure 2: Visualization of masking frequency over 1,600 iterations for Random masking (MAE) and Complementary Masking (CoMA). Dark red regions indicate patches with higher masking frequencies, while lighter regions correspond to less frequently masked patches. Both strategies employ a masking ratio of 60%.
  • Figure 3: CoMA: Complementary Masked Autoencoder. Model $M_t$ serves as the adaptive model and participates in gradient backpropagation, while model $\text{M}_{t-1}$ acts as the evaluation model and remains completely frozen. The parameters of $\text{M}_{t-1}$ are updated solely based on those of $\text{M}_t$ at time step $t$.
  • Figure 4: The structure of the proposed DyViT model and its core attention mechanism. (a) Overview of the DyViT architecture. (b) Illustration of the Dynamic Multi-window Self-Attention module (DM-MSA).
  • Figure 5: Ablation study across masking ratios with $32 \times 32$ patches and 300 pre-training epochs (left); reconstruction loss on the validation set using an 8-layer decoder (middle); impact of decoder depth on classification transferability (right).