Table of Contents
Fetching ...

In Pursuit of Pixel Supervision for Visual Pre-training

Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao, Hu Xu

TL;DR

This work advocates pixel-space self-supervision for visual pre-training and introduces Pixio, an enhanced MAE with a deeper decoder, larger masking blocks, and more class tokens, trained on 2B web images with minimal curation. Across monocular depth, 3D reconstruction, semantic segmentation, and robot learning, Pixio matches or outperforms state-of-the-art latent-space methods like DINOv3 at similar scales, demonstrating robust cross-domain transfer. The authors provide extensive ablations, distillation experiments, and implementation details, and they acknowledge limitations of pixel-only masking while outlining future directions toward web-scale video data and temporal objectives. Overall, the results position pixel-based supervision as a strong, scalable complement to latent-space approaches for visual foundation models.

Abstract

At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.

In Pursuit of Pixel Supervision for Visual Pre-training

TL;DR

This work advocates pixel-space self-supervision for visual pre-training and introduces Pixio, an enhanced MAE with a deeper decoder, larger masking blocks, and more class tokens, trained on 2B web images with minimal curation. Across monocular depth, 3D reconstruction, semantic segmentation, and robot learning, Pixio matches or outperforms state-of-the-art latent-space methods like DINOv3 at similar scales, demonstrating robust cross-domain transfer. The authors provide extensive ablations, distillation experiments, and implementation details, and they acknowledge limitations of pixel-only masking while outlining future directions toward web-scale video data and temporal objectives. Overall, the results position pixel-based supervision as a strong, scalable complement to latent-space approaches for visual foundation models.

Abstract

At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.

Paper Structure

This paper contains 27 sections, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Pixel supervision compels the model to compress and re-organize visual knowledge across all levels. To accurately predict pixels, the model must understand geometry, texture, semantics, materials, lighting, etc. By masking and pixel reconstruction, MAE learns these desirable visual properties and even exhibits early reasoning capabilities wiedemer2025video. From left to right in each group: masked input, reconstructed image (visible patches are kept), ground truth image (unseen during training).
  • Figure 2: Pixio introduces four simple yet critical updates to MAE, with following motivations. Deeper decoder: MAE's shallow decoder lacks capacity for pixel regression, forcing the encoder to sacrifice representation quality for reconstruction. Larger mask block: single-patch masking causes reconstruction shortcuts and provides insufficient context. More [CLS] tokens: a single class token cannot capture diverse global properties. Web-scale training data: IN-1K lacks the visual diversity needed for learning transferable representations.
  • Figure 3: Probing frozen features in different blocks of the original MAE encoder, which is trained on ImageNet-1K. The relative block depth is computed as the ratio of the block index to the total number of blocks, for easy comparison across architectures (ViT-H: 32 blocks, ViT-L: 24 blocks). We use a linear head for both monocular depth estimation (regression) and semantic segmentation (classification).
  • Figure 4: Ablation study of using decoders of different depth (#attention blocks) or width (feature dimension) to train MAE on IN-21K. The encoder is ViT-H (1280-d $\times$ 32-blocks). Here, we use a DPT head for depth estimation and a linear head for semantic segmentation.
  • Figure 5: Ablation study on masking granularity (measured in #patches). MAE uses single-patch (1$\times$1) masking granularity.
  • ...and 2 more figures