Gaussian Masked Autoencoders

Jathushan Rajasegaran; Xinlei Chen; Rulilong Li; Christoph Feichtenhofer; Jitendra Malik; Shiry Ginosar

Gaussian Masked Autoencoders

Jathushan Rajasegaran, Xinlei Chen, Rulilong Li, Christoph Feichtenhofer, Jitendra Malik, Shiry Ginosar

TL;DR

Gaussian Masked Autoencoders (GMAE) extend Masked Autoencoders by learning a mid-level 3D Gaussian representation that is rendered into 2D images via differentiable splatting. The Gaussians are parameterized by $g=\\{p, s, \phi, r, o\} \in \mathbb{R}^{14}$ with covariance $\\Sigma = R S S^{T} R^{T}$ and $S = \text{diag}(s)$, and are learned through a pixel-space reconstruction objective that promotes joint semantic and spatial understanding. GMAE achieves competitive semantic performance on ImageNet and COCO while enabling zero-shot spatial tasks such as figure-ground segmentation, image layering, and edge detection, evidenced by both quantitative metrics and qualitative visualizations. By integrating differentiable 3D reasoning into self-supervised learning, GMAE offers a scalable framework for high-fidelity visual data modeling and suggests directions for future exploration of mid-level representations.

Abstract

This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While reconstructive self-supervised learning frameworks such as MAE learns good semantic abstractions, it is not trained for explicit spatial awareness. Our approach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semantic abstractions and spatial understanding jointly. Like MAE, it reconstructs the image end-to-end in the pixel space, but beyond MAE, it also introduces an intermediate, 3D Gaussian-based representation and renders images via splatting. We show that GMAE can enable various zero-shot learning capabilities of spatial understanding (e.g., figure-ground segmentation, image layering, edge detection, etc.) while preserving the high-level semantics of self-supervised representation quality from MAE. To our knowledge, we are the first to employ Gaussian primitives in an image representation learning framework beyond optimization-based single-scene reconstructions. We believe GMAE will inspire further research in this direction and contribute to developing next-generation techniques for modeling high-fidelity visual data. More details at https://brjathu.github.io/gmae

Gaussian Masked Autoencoders

TL;DR

with covariance

and

, and are learned through a pixel-space reconstruction objective that promotes joint semantic and spatial understanding. GMAE achieves competitive semantic performance on ImageNet and COCO while enabling zero-shot spatial tasks such as figure-ground segmentation, image layering, and edge detection, evidenced by both quantitative metrics and qualitative visualizations. By integrating differentiable 3D reasoning into self-supervised learning, GMAE offers a scalable framework for high-fidelity visual data modeling and suggests directions for future exploration of mid-level representations.

Abstract

Paper Structure (13 sections, 1 equation, 12 figures, 6 tables)

This paper contains 13 sections, 1 equation, 12 figures, 6 tables.

Introduction
Related work
Method
Preliminaries
Our Approach
Experiments
Design Choices
Supervised Tasks
Unsupervised Tasks
Qualitative Results
Discussion
Training details
More Samples

Figures (12)

Figure 1: Gaussian Masked Autoencoders (GMAE) maintains high performance in supervised representation learning tasks such as classification, detection, and segmentation, but more importantly enables zero-shot capabilities. GMAE introduces a learned mid-level intermediate representation of 3D Gaussians that we train using pixel-based image reconstruction losses rather than direct supervision by rendering the Gaussians into pixel space. Through this reconstruction loss, the Gaussian collection learns to distribute non-uniformly in space and scale, dynamically following the input image's information density and high-frequency details. Having the degree of freedom in depth allows the model to learn the layering of objects and scenes, which enables figure-ground separation, layering, and edge detection without any training.
Figure 2: Masked Autoencoding via Gaussian Splatting: The ViT Encoder processes masked input image patches to produce [draw,inner sep=2pt,rounded corners,fill=orange!30]Alatent embeddings. The ViT Decoder then predicts explicit Gaussian parameters based on [draw,inner sep=2pt,rounded corners,fill=blue!30]Aquery tokens, including color, opacity, center, scale, and orientation. These Gaussians are then rendered via differentiable volume splatting kerbl20233d to reconstruct the original image. We pre-train our models fully end-to-end with self-supervision.
Figure 3: Number of Gaussians: ImageNet classification performance with 64, 128, 256, and 512 Gaussians at the decoder during pre-training. We evaluate these models on linear probing and full finetuing. As we increase the number of Gaussians, the performance with linear probing increases monotonically. For full fine-tuning, we see similar behavior at first that saturates after 256 Gaussians.
Figure 4: Effect of scale on reconstruction: Here we have visualizations to show that With small-scale Gaussians the model can not complete the whole image with a fixed number of Gaussians.
Figure 5: Reconstruction Quality: Examples of test-time reconstructions when the input is fully visible (mask ratio=0). The first row is the RGB images and the second row is our reconstructions. Having a decoupled decoder allows us to perform inference with any masking ratio, even though this model is trained with a 0.75 ratio. The dynamically learned non-uniform spatial and scale distribution of Gaussians enables GMAE to reconstruct high-frequency regions like lines and edges. Our rFID score is 89.45, while MAE rFID is 98.12 (smaller is better), and PSNR of GMAE is 18.74 and of MAE is 18.63.
...and 7 more figures

Gaussian Masked Autoencoders

TL;DR

Abstract

Gaussian Masked Autoencoders

Authors

TL;DR

Abstract

Table of Contents

Figures (12)