Table of Contents
Fetching ...

Pre-training with Random Orthogonal Projection Image Modeling

Maryam Haghighat, Peyman Moghadam, Shaheer Mohamed, Piotr Koniusz

TL;DR

ROPIM introduces a self-supervised pre-training method for Vision Transformers that substitutes traditional patch masking with random orthogonal projection (count sketching) of patch embeddings, yielding a continuous masking pattern with a provable bound on reconstruction noise. The model processes the projected embeddings and uses the projection complement to guide the recovery of removed information, without requiring a large decoder or tokenizer. Empirically, ROPIM achieves state-of-the-art or competitive Top-1 accuracy on ImageNet-1k and strong transfer results on iNaturalist, CIFAR, and ADE20K, while reducing pre-training time compared to competing methods. This approach offers a principled, efficient alternative to MIM that preserves rich masking-unmasking dynamics and is readily applicable to standard ViT architectures.

Abstract

Masked Image Modeling (MIM) is a powerful self-supervised strategy for visual pre-training without the use of labels. MIM applies random crops to input images, processes them with an encoder, and then recovers the masked inputs with a decoder, which encourages the network to capture and learn structural information about objects and scenes. The intermediate feature representations obtained from MIM are suitable for fine-tuning on downstream tasks. In this paper, we propose an Image Modeling framework based on random orthogonal projection instead of binary masking as in MIM. Our proposed Random Orthogonal Projection Image Modeling (ROPIM) reduces spatially-wise token information under guaranteed bound on the noise variance and can be considered as masking entire spatial image area under locally varying masking degrees. Since ROPIM uses a random subspace for the projection that realizes the masking step, the readily available complement of the subspace can be used during unmasking to promote recovery of removed information. In this paper, we show that using random orthogonal projection leads to superior performance compared to crop-based masking. We demonstrate state-of-the-art results on several popular benchmarks.

Pre-training with Random Orthogonal Projection Image Modeling

TL;DR

ROPIM introduces a self-supervised pre-training method for Vision Transformers that substitutes traditional patch masking with random orthogonal projection (count sketching) of patch embeddings, yielding a continuous masking pattern with a provable bound on reconstruction noise. The model processes the projected embeddings and uses the projection complement to guide the recovery of removed information, without requiring a large decoder or tokenizer. Empirically, ROPIM achieves state-of-the-art or competitive Top-1 accuracy on ImageNet-1k and strong transfer results on iNaturalist, CIFAR, and ADE20K, while reducing pre-training time compared to competing methods. This approach offers a principled, efficient alternative to MIM that preserves rich masking-unmasking dynamics and is readily applicable to standard ViT architectures.

Abstract

Masked Image Modeling (MIM) is a powerful self-supervised strategy for visual pre-training without the use of labels. MIM applies random crops to input images, processes them with an encoder, and then recovers the masked inputs with a decoder, which encourages the network to capture and learn structural information about objects and scenes. The intermediate feature representations obtained from MIM are suitable for fine-tuning on downstream tasks. In this paper, we propose an Image Modeling framework based on random orthogonal projection instead of binary masking as in MIM. Our proposed Random Orthogonal Projection Image Modeling (ROPIM) reduces spatially-wise token information under guaranteed bound on the noise variance and can be considered as masking entire spatial image area under locally varying masking degrees. Since ROPIM uses a random subspace for the projection that realizes the masking step, the readily available complement of the subspace can be used during unmasking to promote recovery of removed information. In this paper, we show that using random orthogonal projection leads to superior performance compared to crop-based masking. We demonstrate state-of-the-art results on several popular benchmarks.
Paper Structure (20 sections, 1 theorem, 4 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 20 sections, 1 theorem, 4 equations, 10 figures, 10 tables, 1 algorithm.

Key Result

Proposition 1

Let $K$ and $K'$ be the sizes of the input and the projected output. Let vector $\mathbf{h}\!\in\!\mathcal{I}_{K'}^K$ contain $K$ uniformly drawn integer numbers from $\{1,\cdots,K'\}$ and vector $\mathbf{s}\!\in\!\{-1,1\}^{K}$ contain $K$ uniformly drawn values from $\{-1,1\}$. The projection matri

Figures (10)

  • Figure 1: Training efficiency of ROPIM vs. other methods. ROPIM achieves a higher accuracy (see also LGP-ROPIM) with a lower training time. The blue and yellow regions indicate fast methods and high-accuracy methods, respectively. ROPIM has both high accuracy and is fast (the green region).
  • Figure 2: Our proposed Random Orthogonal Projection Image Modeling (ROPIM) vs. Masked Image Modeling (MIM). MIM in Fig. \ref{['fig:mim']} performs masking on patches of an input image, passed to the backbone, followed by unmasking. Our ROPIM in Fig. \ref{['fig:ssim']} performs the orthogonal projection of patch embeddings onto a random subspace, passed to the backbone, followed by application of the complement of orthogonal projection. Thus, the loss focuses on the recovery of the lost information.
  • Figure 3: For MIM, unmasked parts of the recovered image, combined with the masked parts do approximate the input image. Our tokens, randomly projected and complement of the projection (equivalent of unmasking) along spatial modes, also approximately recover the input when added together.
  • Figure 4: Left to right: original image, masking, unmasking, ROP, complement of ROP. Notice the "continuous" masking nature of ROP and complement of ROP.
  • Figure 5: Understanding the projection of $\boldsymbol{\phi}$ on the unitary projection matrix $\mathbf{P}$ (subspace), given as $\mathbf{P}\boldsymbol{\phi}$, and its retraction given as $\boldsymbol{\phi}'\!=\!\mathbf{P}^\dagger\mathbf{P}\boldsymbol{\phi}$. Projection matrix $\bar{\mathbf{P}}$ (subspace) complementary to $\mathbf{P}$ is also indicated. Vector $\boldsymbol{\phi}$ projected on $\bar{\mathbf{P}}$ and then retracted from it is given as $\boldsymbol{\phi}"\!\!=\!\bar{\mathbf{P}}^\dagger\bar{\mathbf{P}}\boldsymbol{\phi}$. Notice that $\boldsymbol{\phi}'\!+\!\boldsymbol{\phi}"\!\!=\!\boldsymbol{\phi}$. The lossy nature of this projection occurs when $\mathbf{P}^\dagger\mathbf{P}+\bar{\mathbf{P}}^\dagger\bar{\mathbf{P}}\neq\boldsymbol{\mathds{I}}$, i.e., not the full diagonal matrix is recovered.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • proof
  • proof