Table of Contents
Fetching ...

From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

Wenzhao Xiang, Yue Wu, Hongyang Yu, Feng Gao, Fan Yang, Xilin Chen

TL;DR

C2FMAE, a coarse-to-fine masked autoencoder that achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of the hierarchical design in learning more robust and generalizable representations.

Abstract

Self-supervised visual pre-training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine-grained detail, while masked image modeling (MIM) preserves local textures but suffers from "attention drift" due to semantically-agnostic random masking. We propose C2FMAE, a coarse-to-fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene-level), instance masks (object-level), and RGB images (pixel-level). Two synergistic innovations enforce a strict top-down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic-guided to instance-guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large-scale multi-granular dataset with high-quality pseudo-labels for all 1.28M ImageNet-1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.

From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

TL;DR

C2FMAE, a coarse-to-fine masked autoencoder that achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of the hierarchical design in learning more robust and generalizable representations.

Abstract

Self-supervised visual pre-training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine-grained detail, while masked image modeling (MIM) preserves local textures but suffers from "attention drift" due to semantically-agnostic random masking. We propose C2FMAE, a coarse-to-fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene-level), instance masks (object-level), and RGB images (pixel-level). Two synergistic innovations enforce a strict top-down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic-guided to instance-guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large-scale multi-granular dataset with high-quality pseudo-labels for all 1.28M ImageNet-1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.
Paper Structure (16 sections, 12 equations, 6 figures, 6 tables)

This paper contains 16 sections, 12 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Attention maps from different methods, highlighting their representational focus. DINO excels at capturing high-level semantics, while MAE and MultiMAE's attention is directed toward low-level features. In contrast, our C2FMAE effectively captures features across all levels, successfully building a more robust hierarchical representation.
  • Figure 2: C2FMAE pre-training framework. Multi-granular data (RGB, Instance, Semantic masks) is first masked by Progressive Masking and then concatenated, and fed to a transformer encoder. Encoded tokens subsequently flow into a cascaded decoder with three task-specific blocks. Each block is a standard Transformer decoder block, composed of self-attention, cross-attention, and feed-forward network layers. We use linear layers as the final predictor. As training progresses, the masking strategy transitions from semantic-guided masking to instance-guided masking, and finally to random masking to build hierarchical visual representations.
  • Figure 3: The variation curves of $\alpha_I$ and $\alpha_S$ during the training process.
  • Figure 4: Predictions of C2FMAE on masked multi-granular data. All the tested images are from the ImageNet-1K validation set and masked with the random masking strategy.
  • Figure 5: Single-modal prediction of C2FMAE and MultiMAE on the ImageNet-1K validation set.
  • ...and 1 more figures