Table of Contents
Fetching ...

Masked Modeling for Self-supervised Representation Learning on Vision and Beyond

Siyuan Li, Luyuan Zhang, Zedong Wang, Di Wu, Lirong Wu, Zicheng Liu, Jun Xia, Cheng Tan, Yang Liu, Baigui Sun, Stan Z. Li

TL;DR

Masked Modeling for Self-supervised Representation Learning on Vision and Beyond presents a unified four-module framework (Mask, Target, Encoder, Head) for masked image modeling and extends it to diverse modalities. It contrasts Masked Modeling with traditional contrastive SSL, surveys masking strategies, target types, and network designs, and highlights theoretical perspectives and practical extensions including autoregressive generation and vision foundation models. The survey aggregates a wide range of MIM methods across vision, audio, graphs, and biology, and discusses downstream applications in video, detection, medical imaging, OCR, and 3D vision, while outlining limitations and future directions such as multimodality, efficient training with large models, and generalist architectures. Overall, the work provides a comprehensive taxonomy, methodological guidance, and practical insights to accelerate masked modeling research across domains.

Abstract

As the deep learning revolution marches on, self-supervised learning has garnered increasing attention in recent years thanks to its remarkable representation learning ability and the low dependence on labeled data. Among these varied self-supervised techniques, masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training. This paradigm enables deep models to learn robust representations and has demonstrated exceptional performance in the context of computer vision, natural language processing, and other modalities. In this survey, we present a comprehensive review of the masked modeling framework and its methodology. We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more. Then, we systematically investigate its wide-ranging applications across domains. Furthermore, we also explore the commonalities and differences between masked modeling methods in different fields. Toward the end of this paper, we conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research. A paper list project with this survey is available at \url{https://github.com/Lupin1998/Awesome-MIM}.

Masked Modeling for Self-supervised Representation Learning on Vision and Beyond

TL;DR

Masked Modeling for Self-supervised Representation Learning on Vision and Beyond presents a unified four-module framework (Mask, Target, Encoder, Head) for masked image modeling and extends it to diverse modalities. It contrasts Masked Modeling with traditional contrastive SSL, surveys masking strategies, target types, and network designs, and highlights theoretical perspectives and practical extensions including autoregressive generation and vision foundation models. The survey aggregates a wide range of MIM methods across vision, audio, graphs, and biology, and discusses downstream applications in video, detection, medical imaging, OCR, and 3D vision, while outlining limitations and future directions such as multimodality, efficient training with large models, and generalist architectures. Overall, the work provides a comprehensive taxonomy, methodological guidance, and practical insights to accelerate masked modeling research across domains.

Abstract

As the deep learning revolution marches on, self-supervised learning has garnered increasing attention in recent years thanks to its remarkable representation learning ability and the low dependence on labeled data. Among these varied self-supervised techniques, masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training. This paradigm enables deep models to learn robust representations and has demonstrated exceptional performance in the context of computer vision, natural language processing, and other modalities. In this survey, we present a comprehensive review of the masked modeling framework and its methodology. We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more. Then, we systematically investigate its wide-ranging applications across domains. Furthermore, we also explore the commonalities and differences between masked modeling methods in different fields. Toward the end of this paper, we conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research. A paper list project with this survey is available at \url{https://github.com/Lupin1998/Awesome-MIM}.
Paper Structure (51 sections, 25 equations, 15 figures, 7 tables)

This paper contains 51 sections, 25 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Research in self-supervised learning (SSL) can be broadly categorized into Generative and Discriminative paradigms. We reviewed major SSL research since 2008 and found that SSL has followed distinct developmental trajectories and stages across time periods and modalities. Since 2018, SSL in NLP has been dominated by generative masked language modeling, which remains mainstream. In computer vision, discriminative contrastive learning (CL) dominated from 2018 to 2021 before masked image modeling gained prominence after 2022.
  • Figure 2: Illustration of two popular self-supervised learning (SSL) frameworks. For simplicity, the input data can be serialized and transformed into a sequence of embedded tokens. (a) Contrastive learning (CL) learns discriminative representation from two augmented views of input data sequences by aligning two projected tokens. (b) Masked Modeling learns contextual information by the generative paradigm that reconstructs the masked tokens.
  • Figure 3: SSL is universally divided into generative and discriminative liu2021self. The generative model can be divided into AR, AE, Flow-based, GAN-based, and diffusion-based models where the AE model can be divided into Denoised AE and Masked AE. This survey is focused on AR and AE models for SSL and relevant tasks.
  • Figure 4: Mathmetical notations.
  • Figure 5: The overview of the basic MIM framework, containing four building blocks with their internal components and functionalities. All MIM research can be summarized as innovations upon these four blocks, i.e., Masking, Encoder, Target, and Head. The general frameworks of masked modeling for other modalities are similar to this framework.
  • ...and 10 more figures