Lightweight Model Pre-training via Language Guided Knowledge Distillation

Mingsheng Li; Lin Zhang; Mingzhen Zhu; Zilong Huang; Gang Yu; Jiayuan Fan; Tao Chen

Lightweight Model Pre-training via Language Guided Knowledge Distillation

Mingsheng Li, Lin Zhang, Mingzhen Zhu, Zilong Huang, Gang Yu, Jiayuan Fan, Tao Chen

TL;DR

This work introduces Language-Guided Distillation (LGD) to pre-train lightweight computer vision models without labels by leveraging task-oriented language guidance. LGD builds a Textual Semantics Bank (TSB) from a pre-trained text encoder and a momentum-updated Visual Semantics Bank (VSB) via Language-Guided Knowledge Aggregation (LGKA), then applies two cross-space alignment losses to drive the student to imitate the teacher in both visual and textual spaces. The resulting framework, with a simple joint objective $L_{LGD} = \alpha L_{VIS} + (1-\alpha) L_{TEX}$, demonstrates state-of-the-art transfer to zero-shot, semi-/fully-supervised classification, object detection, instance segmentation, and semantic segmentation across multiple datasets and backbone pairs. The approach also supports task-specific text control and generalizes to various image encoders, highlighting its practical potential for efficient, mobile-friendly model pre-training.

Abstract

This paper studies the problem of pre-training for small models, which is essential for many mobile devices. Current state-of-the-art methods on this problem transfer the representational knowledge of a large network (as a Teacher) into a smaller model (as a Student) using self-supervised distillation, improving the performance of the small model on downstream tasks. However, existing approaches are insufficient in extracting the crucial knowledge that is useful for discerning categories in downstream tasks during the distillation process. In this paper, for the first time, we introduce language guidance to the distillation process and propose a new method named Language-Guided Distillation (LGD) system, which uses category names of the target downstream task to help refine the knowledge transferred between the teacher and student. To this end, we utilize a pre-trained text encoder to extract semantic embeddings from language and construct a textual semantic space called Textual Semantics Bank (TSB). Furthermore, we design a Language-Guided Knowledge Aggregation (LGKA) module to construct the visual semantic space, also named Visual Semantics Bank (VSB). The task-related knowledge is transferred by driving a student encoder to mimic the similarity score distribution inferred by a teacher over TSB and VSB. Compared with other small models obtained by either ImageNet pre-training or self-supervised distillation, experiment results show that the distilled lightweight model using the proposed LGD method presents state-of-the-art performance and is validated on various downstream tasks, including classification, detection, and segmentation. We have made the code available at https://github.com/mZhenz/LGD.

Lightweight Model Pre-training via Language Guided Knowledge Distillation

TL;DR

, demonstrates state-of-the-art transfer to zero-shot, semi-/fully-supervised classification, object detection, instance segmentation, and semantic segmentation across multiple datasets and backbone pairs. The approach also supports task-specific text control and generalizes to various image encoders, highlighting its practical potential for efficient, mobile-friendly model pre-training.

Abstract

Paper Structure (26 sections, 9 equations, 5 figures, 8 tables)

This paper contains 26 sections, 9 equations, 5 figures, 8 tables.

Introduction
Related Work
Small Model Pre-training
Vision-Language Learning
Knowledge Distillation
The Proposed Method
Overview
Language-Guided Knowledge Aggregation
Language-aware Knowledge Transfer
Generalize to More Image Encoders
Text Control on Downstream Tasks
Experiment
Setup
Main Results
Results on Zero-shot Classification.
...and 11 more sections

Figures (5)

Figure 1: (a) Comparison of previous self-supervised distillation (SSD) and the proposed Language-Guided Distillation (LGD) on Imagenet-1k, see more details in sec. \ref{['exp-classification']}. (b) Comparison of using different input texts for LGD on ImageNet-1k, then fine-tuning on downstream tasks, e.g., classification on Caltech-256, and segmentation on Cityscapes and ADE20K. Using task-related category names as the input language for knowledge distillation could bring constant improvements, see more details in sec. \ref{['exp-downstream']}.
Figure 2: The overview of the proposed Language-Guided Distillation (LGD) framework. First, all the textual features are extracted by feeding [text prompt] + [class] into a pre-trained text encoder and stored in the Textual Semantics Bank (TSB) at one time. Besides, visual features are extracted by feeding unlabeled images into a pre-trained image encoder (teacher) and the student. Then, a Language-Guided Knowledge Aggregation (LGKA) module is developed to classify the visual features of each batch by the textual anchors, and maintain a Visual Semantics Bank (VSB) to store centroid features of different categories by momentum updating. At last, to align the feature similarity between the teacher/student feature and anchor in both textual and visual space, the cross-entropy loss is adopted to constrain the visual and textual similarity distribution between teacher and student. Best viewed in color.
Figure 3: The Grad-CAM visualization shows the different attention maps of distilled RN18 in columns (a)-(d). The first column shows the attention map of CLIP pre-trained RN50, which is the teacher model. Columns (a)-(d) show the visualization results of SEED, LGD with only visual space alignment, LGD with only textual space alignment, and LGD with both losses, respectively.
Figure 4: Visualization of feature distributions. (a) CLIP RN50 (teacher); (b) LGD RN18; (c) SEED RN18. Different colors represent data samples of different classes. Zoom in for details. Best viewed in color.
Figure 5: Visualization of feature distributions. (a) LGD; (b) SEED. Different colors represent different types of features, including teacher feature, student feature and text feature. The numbers represent the classes of data samples. Zoom in for details. Best viewed in color.

Lightweight Model Pre-training via Language Guided Knowledge Distillation

TL;DR

Abstract

Lightweight Model Pre-training via Language Guided Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)