Table of Contents
Fetching ...

Evolved Hierarchical Masking for Self-Supervised Learning

Zhanzhou Feng, Shiliang Zhang

TL;DR

This work tackles the limitation of fixed masking patterns in Masked Image Modeling by introducing evolved hierarchical masking, which builds a dynamic, multi-level cue hierarchy from the model's own attention and evolves the masking depth as training progresses. The Adaptive Hierarchy Establishment module constructs a binary tree over image patches per input, and the Evolved Mask Generation module selects masking depth and patch sets according to the training epoch, enabling a smooth shift from low-level textures to high-level semantics without additional annotations or pre-trained models. Empirically, the approach yields substantial gains across seven downstream tasks and delivers competitive performance with fewer pretraining epochs, bridging self-supervised learning with semantic-heavy tasks and offering a cost-efficient path toward large-scale pretraining. The method also provides insights into how masking at different cue levels affects learning in shallow versus deep layers, and it scales across diverse MIM architectures and backbones.

Abstract

Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling capability.This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1\% in imageNet-1K classification and 1.4\% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.

Evolved Hierarchical Masking for Self-Supervised Learning

TL;DR

This work tackles the limitation of fixed masking patterns in Masked Image Modeling by introducing evolved hierarchical masking, which builds a dynamic, multi-level cue hierarchy from the model's own attention and evolves the masking depth as training progresses. The Adaptive Hierarchy Establishment module constructs a binary tree over image patches per input, and the Evolved Mask Generation module selects masking depth and patch sets according to the training epoch, enabling a smooth shift from low-level textures to high-level semantics without additional annotations or pre-trained models. Empirically, the approach yields substantial gains across seven downstream tasks and delivers competitive performance with fewer pretraining epochs, bridging self-supervised learning with semantic-heavy tasks and offering a cost-efficient path toward large-scale pretraining. The method also provides insights into how masking at different cue levels affects learning in shallow versus deep layers, and it scales across diverse MIM architectures and backbones.

Abstract

Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling capability.This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1\% in imageNet-1K classification and 1.4\% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.

Paper Structure

This paper contains 14 sections, 17 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: (a), (b), and (c) are three basic mask patterns adopted in existing MIM methods. (d) illustrates the proposed evolved hierarchical masking, where the generated mask patterns evolve with the capability of the vision model being trained.
  • Figure 2: Illustration of effects of different mask patterns to downstream tasks in (a), and learned parameters in (b), respectively. In (a), random pattern and block pattern perform best in image classification and semantic segmentation, respectively. (b) shows the mean attention distance across images at different layers of the pre-trained model. The unit of the y-axis is the pixel. Those results indicate different mask patterns are suited to different tasks.
  • Figure 3: Visualization of [CLS] attention heatmap from classification deng2009imagenet in (a), semantic segmentation result zhou2017scene in (b), and landmark retrieval results on Oxford Building philbin2007objectphilbin2008lost in (c). These visualizations show the proposed method exhibits superior visual cue acquisition capacity at different semantic levels. Best viewing with zoom-in.
  • Figure 4: Illustration of established visual cue hierarchy structure $\mathcal{T}$. The MIM is performed by reconstructing masked patches according to visible regions. Different mask patterns can be adopted by alternating the mask depth on the hierarchy structure, e.g., masking nodes on the 3$^{rd}$ level or the 1$^{st}$ level.
  • Figure 5: The pipeline of proposed Evolved Hierarchical Masking using MAE he2022masked as an example. Input image is fed into the encoder extracting the attention map $A$ to reflect the similarity between each pair of patches. The Adaptive Hierarchy Establishment module organizes leaf nodes and latent variables into the hierarchy $\mathcal{T}$. Based on $\mathcal{T}$, the Evolved Mask Generation module generates masks $\mathcal{M}$ on a specific masking depth.
  • ...and 6 more figures