Table of Contents
Fetching ...

MaskAnyNet: Rethinking Masked Image Regions as Valuable Information in Supervised Learning

Jingshan Hong, Haigen Hu, Huihuang Zhang, Qianwei Zhou, Zhao Li

TL;DR

MaskAnyNet addresses the loss of information caused by masking in supervised learning by introducing a dual-branch architecture that reuses masked regions as auxiliary knowledge. The mask region reuse branch reconstructs and reintegrates masked content, coupled with a feature fusion and alignment module, enabling both global semantics and local detail learning across CNNs and Transformers. Across CIFAR, ImageNet, and downstream detection/segmentation tasks, MaskAnyNet yields consistent Top-1 gains and ablation studies confirm the contributions of masking, reuse, and fusion. This approach enhances semantic diversity and pixel utilization, offering a practical path to stronger generalization with masked inputs.

Abstract

In supervised learning, traditional image masking faces two key issues: (i) discarded pixels are underutilized, leading to a loss of valuable contextual information; (ii) masking may remove small or critical features, especially in fine-grained tasks. In contrast, masked image modeling (MIM) has demonstrated that masked regions can be reconstructed from partial input, revealing that even incomplete data can exhibit strong contextual consistency with the original image. This highlights the potential of masked regions as sources of semantic diversity. Motivated by this, we revisit the image masking approach, proposing to treat masked content as auxiliary knowledge rather than ignored. Based on this, we propose MaskAnyNet, which combines masking with a relearning mechanism to exploit both visible and masked information. It can be easily extended to any model with an additional branch to jointly learn from the recomposed masked region. This approach leverages the semantic diversity of the masked regions to enrich features and preserve fine-grained details. Experiments on CNN and Transformer backbones show consistent gains across multiple benchmarks. Further analysis confirms that the proposed method improves semantic diversity through the reuse of masked content.

MaskAnyNet: Rethinking Masked Image Regions as Valuable Information in Supervised Learning

TL;DR

MaskAnyNet addresses the loss of information caused by masking in supervised learning by introducing a dual-branch architecture that reuses masked regions as auxiliary knowledge. The mask region reuse branch reconstructs and reintegrates masked content, coupled with a feature fusion and alignment module, enabling both global semantics and local detail learning across CNNs and Transformers. Across CIFAR, ImageNet, and downstream detection/segmentation tasks, MaskAnyNet yields consistent Top-1 gains and ablation studies confirm the contributions of masking, reuse, and fusion. This approach enhances semantic diversity and pixel utilization, offering a practical path to stronger generalization with masked inputs.

Abstract

In supervised learning, traditional image masking faces two key issues: (i) discarded pixels are underutilized, leading to a loss of valuable contextual information; (ii) masking may remove small or critical features, especially in fine-grained tasks. In contrast, masked image modeling (MIM) has demonstrated that masked regions can be reconstructed from partial input, revealing that even incomplete data can exhibit strong contextual consistency with the original image. This highlights the potential of masked regions as sources of semantic diversity. Motivated by this, we revisit the image masking approach, proposing to treat masked content as auxiliary knowledge rather than ignored. Based on this, we propose MaskAnyNet, which combines masking with a relearning mechanism to exploit both visible and masked information. It can be easily extended to any model with an additional branch to jointly learn from the recomposed masked region. This approach leverages the semantic diversity of the masked regions to enrich features and preserve fine-grained details. Experiments on CNN and Transformer backbones show consistent gains across multiple benchmarks. Further analysis confirms that the proposed method improves semantic diversity through the reuse of masked content.

Paper Structure

This paper contains 27 sections, 4 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Comparison of effect between conventional mask discarding and with reuse strategies for complementary visual information based on ResNet-34. Top: the heatmap can only cover part of the target and fails to clearly perceive edge details of the target. (e.g., partial coverage of vehicles or distraction by background elements like ocean scenes). Bottom: the heatmaps can precisely focus on the target area by repurposing these masking regions as complementary visual information sources.
  • Figure 2: Overall architecture of the proposed MaskAnyNet, which consists of three main components: mask generation, mask-region information reuse, and feature fusion and alignment.
  • Figure 3: Comparison of different masking methods. Patch Masking preserves local details, Grid Masking preserves global semantics through fixed-pattern, and Random Masking enhances diversity through irregular patterns.
  • Figure 4: Performance visualization results of MaskResNet-34 using three different masking methods and different mask ratios on the ImageNet-1K dataset.
  • Figure 5: Grad-CAM Visualization of ResNet-34 and MaskResNet-34. It is evident that MaskResNet-34 focuses more accurately on the target regions (e.g., vehicle bodies, animal contours, human figures), while better suppressing background interference.
  • ...and 1 more figures