Table of Contents
Fetching ...

MixMask: Revisiting Masking Strategy for Siamese ConvNets

Kirill Vishniakov, Eric Xing, Zhiqiang Shen

TL;DR

The proposed method replaces erased areas with content from a different image, effectively countering the information depletion seen in traditional masking methods, and surpasses the MSCN, establishing MixMask as a more advantageous masking solution for Siamese ConvNets.

Abstract

The recent progress in self-supervised learning has successfully combined Masked Image Modeling (MIM) with Siamese Networks, harnessing the strengths of both methodologies. Nonetheless, certain challenges persist when integrating conventional erase-based masking within Siamese ConvNets. Two primary concerns are: (1) The continuous data processing nature of ConvNets, which doesn't allow for the exclusion of non-informative masked regions, leading to reduced training efficiency compared to ViT architecture; (2) The misalignment between erase-based masking and the contrastive-based objective, distinguishing it from the MIM technique. To address these challenges, this work introduces a novel filling-based masking approach, termed \textbf{MixMask}. The proposed method replaces erased areas with content from a different image, effectively countering the information depletion seen in traditional masking methods. Additionally, we unveil an adaptive loss function that captures the semantics of the newly patched views, ensuring seamless integration within the architectural framework. We empirically validate the effectiveness of our approach through comprehensive experiments across various datasets and application scenarios. The findings underscore our framework's enhanced performance in areas such as linear probing, semi-supervised and supervised finetuning, object detection and segmentation. Notably, our method surpasses the MSCN, establishing MixMask as a more advantageous masking solution for Siamese ConvNets. Our code and models are publicly available at https://github.com/kirill-vish/MixMask.

MixMask: Revisiting Masking Strategy for Siamese ConvNets

TL;DR

The proposed method replaces erased areas with content from a different image, effectively countering the information depletion seen in traditional masking methods, and surpasses the MSCN, establishing MixMask as a more advantageous masking solution for Siamese ConvNets.

Abstract

The recent progress in self-supervised learning has successfully combined Masked Image Modeling (MIM) with Siamese Networks, harnessing the strengths of both methodologies. Nonetheless, certain challenges persist when integrating conventional erase-based masking within Siamese ConvNets. Two primary concerns are: (1) The continuous data processing nature of ConvNets, which doesn't allow for the exclusion of non-informative masked regions, leading to reduced training efficiency compared to ViT architecture; (2) The misalignment between erase-based masking and the contrastive-based objective, distinguishing it from the MIM technique. To address these challenges, this work introduces a novel filling-based masking approach, termed \textbf{MixMask}. The proposed method replaces erased areas with content from a different image, effectively countering the information depletion seen in traditional masking methods. Additionally, we unveil an adaptive loss function that captures the semantics of the newly patched views, ensuring seamless integration within the architectural framework. We empirically validate the effectiveness of our approach through comprehensive experiments across various datasets and application scenarios. The findings underscore our framework's enhanced performance in areas such as linear probing, semi-supervised and supervised finetuning, object detection and segmentation. Notably, our method surpasses the MSCN, establishing MixMask as a more advantageous masking solution for Siamese ConvNets. Our code and models are publicly available at https://github.com/kirill-vish/MixMask.
Paper Structure (18 sections, 3 equations, 7 figures, 8 tables)

This paper contains 18 sections, 3 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Illustration of the different mask patterns with a mask grid size of 8. (a) and (b) are input images. (c) is the discrete/random mask pattern, and (d) and (e) are mixed images using this mask. (f) is the blocked mask pattern, and (g) and (h) are mixed images with a blocked mask. Discrete masking breaks (c) -- (e) the completeness of an object which is important for the contrastive loss because it operates on the global object level. On the other hand, blocked masking (f) -- (h) preserves important global features leading to superior performance.
  • Figure 2: Illustration of the proposed filling-based masking strategy. The gray dashed box shows Erase/Gaussian noise jing2022masked masking strategy. A formal definition of a switch image in the case of reverse permutation is given in Eq. \ref{['eq:switch']}.
  • Figure 3: Illustration of the Masked Siamese ConvNets (left) and our proposed framework (right). MixMask branch incorporates asymmetry into the loss function design by generating images with different rates of similarity to the images in the original branch. In MixMask branch image of the truck is presented twice with different levels of similarity to the image in the original branch due to the regions masked with contents of another image.
  • Figure 4: MixMask outperforms MSCN on ImageNet-1K by 1%.
  • Figure 5: Results with different training budgets and base frameworks on CIFAR-100. MixMask consistently performs better than the baseline for every configuration.
  • ...and 2 more figures