Table of Contents
Fetching ...

Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

Jie Zhu, Jiyang Qi, Mingyu Ding, Xiaokang Chen, Ping Luo, Xinggang Wang, Wenyu Liu, Leye Wang, Jingdong Wang

TL;DR

The paper investigates what self-supervised pretraining learns by focusing on part-aware representations. It introduces a part-to-whole view for contrastive learning and a part-to-part view for masked image modeling, and validates these with extensive experiments across object-level and part-level tasks using encoders pretrained with DeiT, MoCo v3, DINO, CAE, MAE, BEiT, and iBOT. The results show supervised learning excels at object-level recognition, while self-supervised methods — particularly iBOT, CAE, and combined CL+MIM approaches — excel at part-level recognition, with MAE tending to encode lower-level cues. These findings illuminate how SSL pretraining can capture fine-grained, part-aware representations and suggest design patterns that blend CL and MIM for broad semantic coverage.

Abstract

In this paper, we are interested in understanding self-supervised pretraining through studying the capability that self-supervised representation pretraining methods learn part-aware representations. The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts. We explain that contrastive learning is a part-to-whole task: the projection layer hallucinates the whole object representation from the object part representation learned from the encoder, and that masked image modeling is a part-to-part task: the masked patches of the object are hallucinated from the visible patches. The explanation suggests that the self-supervised pretrained encoder is required to understand the object part. We empirically compare the off-the-shelf encoders pretrained with several representative methods on object-level recognition and part-level recognition. The results show that the fully-supervised model outperforms self-supervised models for object-level recognition, and most self-supervised contrastive learning and masked image modeling methods outperform the fully-supervised method for part-level recognition. It is observed that the combination of contrastive learning and masked image modeling further improves the performance.

Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

TL;DR

The paper investigates what self-supervised pretraining learns by focusing on part-aware representations. It introduces a part-to-whole view for contrastive learning and a part-to-part view for masked image modeling, and validates these with extensive experiments across object-level and part-level tasks using encoders pretrained with DeiT, MoCo v3, DINO, CAE, MAE, BEiT, and iBOT. The results show supervised learning excels at object-level recognition, while self-supervised methods — particularly iBOT, CAE, and combined CL+MIM approaches — excel at part-level recognition, with MAE tending to encode lower-level cues. These findings illuminate how SSL pretraining can capture fine-grained, part-aware representations and suggest design patterns that blend CL and MIM for broad semantic coverage.

Abstract

In this paper, we are interested in understanding self-supervised pretraining through studying the capability that self-supervised representation pretraining methods learn part-aware representations. The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts. We explain that contrastive learning is a part-to-whole task: the projection layer hallucinates the whole object representation from the object part representation learned from the encoder, and that masked image modeling is a part-to-part task: the masked patches of the object are hallucinated from the visible patches. The explanation suggests that the self-supervised pretrained encoder is required to understand the object part. We empirically compare the off-the-shelf encoders pretrained with several representative methods on object-level recognition and part-level recognition. The results show that the fully-supervised model outperforms self-supervised models for object-level recognition, and most self-supervised contrastive learning and masked image modeling methods outperform the fully-supervised method for part-level recognition. It is observed that the combination of contrastive learning and masked image modeling further improves the performance.
Paper Structure (20 sections, 3 equations, 10 figures, 10 tables)

This paper contains 20 sections, 3 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: (a) original image, (b-c) two random crops, and (d-e) masked and visible patches.
  • Figure 2: Top-24 patch retrieval results with three frozen encoders of DeiT, MoCo v3, and CAE, by taking the patch in the red box as the query. It can be seen that the retrieved results from CAE and MoCo v3 are about the object part (wing and dog mouth) and more precise than DeiT (about the whole object) implying that self-supervised pretraining methods, CAE and MoCo v3 are stronger at learning part-aware representations than the fully-supervised method DeiT. Details could be found in Sec. \ref{['sec:method']}.
  • Figure 3: The pipeline of a typical contrastive learning approach. Two augmented views, red box and blue box, are generated from the original image. The augmented view in red is fed into the encoder and the projector, and then the predictor (which does not appear in earlier works like MoCo mocov3_chen2021empirical and SimCLR ChenK0H20), and the view in blue is fed into the encoder and the projector. The two outputs are expected to be aligned. The gradient is stopped for the bottom stream.
  • Figure 4: Illustration of patch search results using encoded representations and projections (pretrained with MoCo v3 as). Left: patch search results with encoded representations. Right: patch search results with projections. In each result, the small patch encircled by the red box is taken as the query. It can be seen that for encoded representations, the returned patches are about the same part, and for projections, the result patches are about the same object, verifying the part-to-whole hypothesis.
  • Figure 5: The pipeline of an MIM approach, context autoencoder (CAE). An augmented view (in blue) of the image is partitioned into visible and masked patches. The CAE approach feeds visible patches into the encoder and extracts their representations $\mathbf{X}_v$ and then completes the pretext task by predicting the representations $\mathbf{X}_m$ of the masked patches from the visible patches in the encoded representation space with latent contextual regressor and alignment constraint, and mapping predicted representations $\mathbf{X}_m$ of masked patches to the targets. The pretrained encoder in (a) is applied to downstream tasks by simply replacing the pretext task part (b, c) with the downstream task completion part.
  • ...and 5 more figures