Table of Contents
Fetching ...

Self-Guided Masked Autoencoders for Domain-Agnostic Self-Supervised Learning

Johnathan Xie, Yoonho Lee, Annie S. Chen, Chelsea Finn

TL;DR

Self-Guided Masked Autoencoders (SMA) introduce a fully domain-agnostic masked modeling approach that learns masking policies from the model's own attention maps, removing the need for domain-specific tokenizers or priors. By applying a masked prediction objective with masks derived from cross- or self-attention, SMA reconstructs masked raw inputs using a single masked model, demonstrating strong representations across protein biology, chemistry, and particle physics. The method shows state-of-the-art performance relative to domain-specific masks on NLP, image, and scientific datasets, suggesting that valuable structure can be discovered purely from unlabeled data. Overall, SMA offers a broadly applicable path for unsupervised representation learning without hand-crafted priors, leveraging attention to induce meaningful masking and robust downstream transfer.

Abstract

Self-supervised learning excels in learning representations from large amounts of unlabeled data, demonstrating success across multiple data modalities. Yet, extending self-supervised learning to new modalities is non-trivial because the specifics of existing methods are tailored to each domain, such as domain-specific augmentations which reflect the invariances in the target task. While masked modeling is promising as a domain-agnostic framework for self-supervised learning because it does not rely on input augmentations, its mask sampling procedure remains domain-specific. We present Self-guided Masked Autoencoders (SMA), a fully domain-agnostic masked modeling method. SMA trains an attention based model using a masked modeling objective, by learning masks to sample without any domain-specific assumptions. We evaluate SMA on three self-supervised learning benchmarks in protein biology, chemical property prediction, and particle physics. We find SMA is capable of learning representations without domain-specific knowledge and achieves state-of-the-art performance on these three benchmarks.

Self-Guided Masked Autoencoders for Domain-Agnostic Self-Supervised Learning

TL;DR

Self-Guided Masked Autoencoders (SMA) introduce a fully domain-agnostic masked modeling approach that learns masking policies from the model's own attention maps, removing the need for domain-specific tokenizers or priors. By applying a masked prediction objective with masks derived from cross- or self-attention, SMA reconstructs masked raw inputs using a single masked model, demonstrating strong representations across protein biology, chemistry, and particle physics. The method shows state-of-the-art performance relative to domain-specific masks on NLP, image, and scientific datasets, suggesting that valuable structure can be discovered purely from unlabeled data. Overall, SMA offers a broadly applicable path for unsupervised representation learning without hand-crafted priors, leveraging attention to induce meaningful masking and robust downstream transfer.

Abstract

Self-supervised learning excels in learning representations from large amounts of unlabeled data, demonstrating success across multiple data modalities. Yet, extending self-supervised learning to new modalities is non-trivial because the specifics of existing methods are tailored to each domain, such as domain-specific augmentations which reflect the invariances in the target task. While masked modeling is promising as a domain-agnostic framework for self-supervised learning because it does not rely on input augmentations, its mask sampling procedure remains domain-specific. We present Self-guided Masked Autoencoders (SMA), a fully domain-agnostic masked modeling method. SMA trains an attention based model using a masked modeling objective, by learning masks to sample without any domain-specific assumptions. We evaluate SMA on three self-supervised learning benchmarks in protein biology, chemical property prediction, and particle physics. We find SMA is capable of learning representations without domain-specific knowledge and achieves state-of-the-art performance on these three benchmarks.
Paper Structure (22 sections, 5 equations, 4 figures, 11 tables, 1 algorithm)

This paper contains 22 sections, 5 equations, 4 figures, 11 tables, 1 algorithm.

Figures (4)

  • Figure 1: Self-guided Masked Autoencoders (SMA). (a) We first extract the cross-attention map from latent queries onto the full input array. (b) Next, add the rows of the attention matrix corresponding to a randomly selected subset of the queries, and produce an input mask by selecting the largest values of the attention sum. (c) Our pre-training objective is to reconstruct the original inputs from the masked input sequence.
  • Figure 2: Visualization of learned masks for ImageNet-100. We show two sampled mask views generated from the same model weights to demonstrate our method can sample diverse masks even with static weights. We observe that masks are well clustered with respect to location and have a minor emphasis on color values.
  • Figure 3: End-to-end diagram of SMA
  • Figure 4: Relationship between number of inputs and the top attention overlap ratio for images. We find as we increase the number of inputs, the overlap ratio increases as well.