POA: Pre-training Once for Models of All Sizes

Yingying Zhang; Xin Guo; Jiangwei Lao; Lei Yu; Lixiang Ru; Jian Wang; Guo Ye; Huimei He; Jingdong Chen; Ming Yang

POA: Pre-training Once for Models of All Sizes

Yingying Zhang, Xin Guo, Jiangwei Lao, Lei Yu, Lixiang Ru, Jian Wang, Guo Ye, Huimei He, Jingdong Chen, Ming Yang

TL;DR

POA tackles the need for deploying vision models under varying resource constraints by pre-training a single teacher and an elastic student that can generate hundreds of sub-networks of different sizes in one self-supervised session. It combines teacher-student self-distillation with an elastic sub-network that shares parameters across widths and depths, enabling cross-view and same-view distillations and a multi-crop training regime. Key contributions include the design of Elastic MSA, Elastic MLP, and Elastic LN, a dual-distillation loss with multiple projection heads, and extensive empirical validation showing SOTA k-NN/LP and strong downstream transfer across ViT, Swin, and ResNet backbones. The approach significantly improves deployment flexibility and efficiency, producing ready-to-use models of various sizes without extra pre-training and offering a path toward scalable multimodal models.

Abstract

Large-scale self-supervised pre-training has paved the way for one foundation model to handle many different vision tasks. Most pre-training methodologies train a single model of a certain size at one time. Nevertheless, various computation or storage constraints in real-world scenarios require substantial efforts to develop a series of models with different sizes to deploy. Thus, in this study, we propose a novel tri-branch self-supervised training framework, termed as POA (Pre-training Once for All), to tackle this aforementioned issue. Our approach introduces an innovative elastic student branch into a modern self-distillation paradigm. At each pre-training step, we randomly sample a sub-network from the original student to form the elastic student and train all branches in a self-distilling fashion. Once pre-trained, POA allows the extraction of pre-trained models of diverse sizes for downstream tasks. Remarkably, the elastic student facilitates the simultaneous pre-training of multiple models with different sizes, which also acts as an additional ensemble of models of various sizes to enhance representation learning. Extensive experiments, including k-nearest neighbors, linear probing evaluation and assessments on multiple downstream tasks demonstrate the effectiveness and advantages of our POA. It achieves state-of-the-art performance using ViT, Swin Transformer and ResNet backbones, producing around a hundred models with different sizes through a single pre-training session. The code is available at: https://github.com/Qichuzyy/POA.

POA: Pre-training Once for Models of All Sizes

TL;DR

Abstract

Paper Structure (62 sections, 14 equations, 11 figures, 19 tables, 1 algorithm)

This paper contains 62 sections, 14 equations, 11 figures, 19 tables, 1 algorithm.

Introduction
Related Work
Self-supervised Learning
Dynamic Architecture.
POA Self-supervised Learning Framework
Design of Elastic Student
Elastic MSA
Elastic MLP
Elastic LN
Distillation between Views
Overall Loss of POA
Experiments
Implementation Details
Backbones.
Pre-Training Setup.
...and 47 more sections

Figures (11)

Figure 1: The k-NN evaluation accuracy of 143 elastic ViTs derived from the ViT-L/16 teacher model pre-trained with POA.
Figure 2: Overview of the POA SSL: Given an image $x$, two augmented views $x_a$ and $x_b$ are generated. These views are input into three branches: a teacher, an intact student, and an elastic student, the latter being derived from the intact student. POA optimizes distillation losses in a twofold manner: the intact and the elastic students are distilled from the teacher using the cross-view data respectively, and additionally, the elastic student is distilled from the intact student using the same-view data.
Figure 3: Illustration of the elastic MSA in an elastic ViT block. To be concise, we simply exclude the projection layers that correspond to $K$ and $V$ in each head.
Figure 4: Illustration of different variants of POA.
Figure 6: Robustness to Occlusion and Shuffling.
...and 6 more figures

POA: Pre-training Once for Models of All Sizes

TL;DR

Abstract

POA: Pre-training Once for Models of All Sizes

Authors

TL;DR

Abstract

Table of Contents

Figures (11)