Table of Contents
Fetching ...

Enhancing Representations through Heterogeneous Self-Supervised Learning

Zhong-Yu Li, Bo-Wen Yin, Yongxiang Liu, Li Liu, Ming-Ming Cheng

TL;DR

Heterogeneous Self-Supervised Learning (HSSL) augments a base neural architecture by attaching an auxiliary head of a different architecture during pre-training. The base model learns to mimic the heterogeneous head’s representations, enabling the transfer of missing architectural characteristics without changing the base’s structure. A key insight is that greater base–head architectural discrepancy yields larger performance gains, which motivates a fast, label-free search to identify the best auxiliary head and simple methods to further enlarge the discrepancy. HSSL is compatible with a wide range of SSL methods and yields strong improvements across image classification, semantic segmentation, instance segmentation, and object detection, while adding only modest training overhead. This approach offers a flexible, general pathway to fuse cross-architecture knowledge in self-supervised learning with practical benefits for diverse vision tasks.

Abstract

Incorporating heterogeneous representations from different architectures has facilitated various vision tasks, e.g., some hybrid networks combine transformers and convolutions. However, complementarity between such heterogeneous architectures has not been well exploited in self-supervised learning. Thus, we propose Heterogeneous Self-Supervised Learning (HSSL), which enforces a base model to learn from an auxiliary head whose architecture is heterogeneous from the base model. In this process, HSSL endows the base model with new characteristics in a representation learning way without structural changes. To comprehensively understand the HSSL, we conduct experiments on various heterogeneous pairs containing a base model and an auxiliary head. We discover that the representation quality of the base model moves up as their architecture discrepancy grows. This observation motivates us to propose a search strategy that quickly determines the most suitable auxiliary head for a specific base model to learn and several simple but effective methods to enlarge the model discrepancy. The HSSL is compatible with various self-supervised methods, achieving superior performances on various downstream tasks, including image classification, semantic segmentation, instance segmentation, and object detection. The codes are available at https://github.com/NK-JittorCV/Self-Supervised/.

Enhancing Representations through Heterogeneous Self-Supervised Learning

TL;DR

Heterogeneous Self-Supervised Learning (HSSL) augments a base neural architecture by attaching an auxiliary head of a different architecture during pre-training. The base model learns to mimic the heterogeneous head’s representations, enabling the transfer of missing architectural characteristics without changing the base’s structure. A key insight is that greater base–head architectural discrepancy yields larger performance gains, which motivates a fast, label-free search to identify the best auxiliary head and simple methods to further enlarge the discrepancy. HSSL is compatible with a wide range of SSL methods and yields strong improvements across image classification, semantic segmentation, instance segmentation, and object detection, while adding only modest training overhead. This approach offers a flexible, general pathway to fuse cross-architecture knowledge in self-supervised learning with practical benefits for diverse vision tasks.

Abstract

Incorporating heterogeneous representations from different architectures has facilitated various vision tasks, e.g., some hybrid networks combine transformers and convolutions. However, complementarity between such heterogeneous architectures has not been well exploited in self-supervised learning. Thus, we propose Heterogeneous Self-Supervised Learning (HSSL), which enforces a base model to learn from an auxiliary head whose architecture is heterogeneous from the base model. In this process, HSSL endows the base model with new characteristics in a representation learning way without structural changes. To comprehensively understand the HSSL, we conduct experiments on various heterogeneous pairs containing a base model and an auxiliary head. We discover that the representation quality of the base model moves up as their architecture discrepancy grows. This observation motivates us to propose a search strategy that quickly determines the most suitable auxiliary head for a specific base model to learn and several simple but effective methods to enlarge the model discrepancy. The HSSL is compatible with various self-supervised methods, achieving superior performances on various downstream tasks, including image classification, semantic segmentation, instance segmentation, and object detection. The codes are available at https://github.com/NK-JittorCV/Self-Supervised/.
Paper Structure (15 sections, 11 equations, 7 figures, 20 tables)

This paper contains 15 sections, 11 equations, 7 figures, 20 tables.

Figures (7)

  • Figure 1: Illustration of the heterogeneous self-supervised learning (HSSL). (a) General self-supervised learning methods make a base model supervise itself. (b) The HSSL supervises the base model under the guidance of an auxiliary head whose architecture is heterogeneous to the base model, making the base model learn new characteristics.
  • Figure 2: Our HSSL framework. The architectures of the base model and the auxiliary head are heterogeneous. The representations extracted by the auxiliary head supervise the two networks simultaneously. The base model and the auxiliary head can be arbitrary architectures, such as ViT dosovitskiy2020vit, Swin liu2021Swin, ConvNext liu2022convnet, ResNet he2016deep, ResMLP touvron2021resmlp, and PoolFormer yu2022metaformer.
  • Figure 3: In (a)-(c), we visualize the relationship between the improvements in the base model (ViT-S/16) and three factors, including (a) the representation discrepancy between the base model and the auxiliary head, (b) the number of parameters of a 1-layer auxiliary head, (c) The capacity of the architecture that is used to build the auxiliary head. For the capacity of each architecture, we use the supervised classification accuracy on ImageNet-1K, reported in the official paper of each architecture, as a reference to its capacity. In (d), we show a consistent trend between the discrepancies obtained by searching and examining each auxiliary head individually. In all figures, the size of the dot is positively related to the improvement brought by the corresponding auxiliary head.
  • Figure 4: Training dynamics of the discrepancy or similarity between the base model and the auxiliary head during pre-training. Left: The discrepancy $\mathcal{D}$ (defined in Equ. (\ref{['eq:kl']})) between the base model and the auxiliary head when using ConvNext liu2022convnet or ResMLP touvron2021resmlp as the auxiliary head and using ViT dosovitskiy2020vit as the base model. Middle: The feature-level CKA similarity between the base model and the auxiliary head. Right: The feature-level Procrustes similarity between the base model and the auxiliary head.
  • Figure 5: Illustration of the quick search strategy. Given $N$ distinct architectures, we construct $N$ different auxiliary heads, where $h_{1/2}^i$ represents the auxiliary head built using $i$-th architecture. The subscripts 1 and 2 indicate teacher and student branches, respectively. In the figure, the red dotted lines and solid lines correspond to the loss of the first and second terms of Equ. (\ref{['eq:L_total_search_each']}). Projection heads are omitted from for clarity.
  • ...and 2 more figures