Table of Contents
Fetching ...

Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments

Hyunwoo Kim, Junha Lee, Mincheol Choi, Jeonghwan Lee, Jaeshin Cho

TL;DR

Progressive Weight Loading (PWL) tackles the challenge of fast, responsive inference in resource-constrained environments by initially deploying a lightweight student model and progressively upgrading its layers with those from a pretrained teacher. A novel invertible feature converter and a multi-loss distillation training strategy enable seamless layer substitution while preserving or improving accuracy, demonstrated across VGG, ResNet, and ViT architectures. The approach achieves fast initial inference akin to the student, while gradually attaining the teacher's performance as memory permits, making it suitable for dynamic deployments in edge and mobile scenarios. The work also shows that PWL can be integrated with other compression methods and potentially extended to NLP and audio domains, offering a flexible path to scalable, responsive AI systems.

Abstract

Deep learning models have become increasingly large and complex, resulting in higher memory consumption and computational demands. Consequently, model loading times and initial inference latency have increased, posing significant challenges in mobile and latency-sensitive environments where frequent model loading and unloading are required, which directly impacts user experience. While Knowledge Distillation (KD) offers a solution by compressing large teacher models into smaller student ones, it often comes at the cost of reduced performance. To address this trade-off, we propose Progressive Weight Loading (PWL), a novel technique that enables fast initial inference by first deploying a lightweight student model, then incrementally replacing its layers with those of a pre-trained teacher model. To support seamless layer substitution, we introduce a training method that not only aligns intermediate feature representations between student and teacher layers, but also improves the overall output performance of the student model. Our experiments on VGG, ResNet, and ViT architectures demonstrate that models trained with PWL maintain competitive distillation performance and gradually improve accuracy as teacher layers are loaded-matching the final accuracy of the full teacher model without compromising initial inference speed. This makes PWL particularly suited for dynamic, resource-constrained deployments where both responsiveness and performance are critical.

Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments

TL;DR

Progressive Weight Loading (PWL) tackles the challenge of fast, responsive inference in resource-constrained environments by initially deploying a lightweight student model and progressively upgrading its layers with those from a pretrained teacher. A novel invertible feature converter and a multi-loss distillation training strategy enable seamless layer substitution while preserving or improving accuracy, demonstrated across VGG, ResNet, and ViT architectures. The approach achieves fast initial inference akin to the student, while gradually attaining the teacher's performance as memory permits, making it suitable for dynamic deployments in edge and mobile scenarios. The work also shows that PWL can be integrated with other compression methods and potentially extended to NLP and audio domains, offering a flexible path to scalable, responsive AI systems.

Abstract

Deep learning models have become increasingly large and complex, resulting in higher memory consumption and computational demands. Consequently, model loading times and initial inference latency have increased, posing significant challenges in mobile and latency-sensitive environments where frequent model loading and unloading are required, which directly impacts user experience. While Knowledge Distillation (KD) offers a solution by compressing large teacher models into smaller student ones, it often comes at the cost of reduced performance. To address this trade-off, we propose Progressive Weight Loading (PWL), a novel technique that enables fast initial inference by first deploying a lightweight student model, then incrementally replacing its layers with those of a pre-trained teacher model. To support seamless layer substitution, we introduce a training method that not only aligns intermediate feature representations between student and teacher layers, but also improves the overall output performance of the student model. Our experiments on VGG, ResNet, and ViT architectures demonstrate that models trained with PWL maintain competitive distillation performance and gradually improve accuracy as teacher layers are loaded-matching the final accuracy of the full teacher model without compromising initial inference speed. This makes PWL particularly suited for dynamic, resource-constrained deployments where both responsiveness and performance are critical.

Paper Structure

This paper contains 29 sections, 12 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustration of the Progressive Weight Loading (PWL) timeline. Initial inference is performed using the lightweight student model, enabling a fast response. As layers are progressively replaced with those of the teacher model, the overall performance improves gradually toward that of the full teacher.
  • Figure 2: Conceptual overview of Progressive Weight Loading (PWL). The student model is initially loaded, and its layers are progressively replaced by those of the teacher model, starting from the input layer. This approach enables a dynamic trade-off between model size and performance, making it well-suited to resource-constrained environments such as mobile and edge computing.
  • Figure 3: Architecture of the feature converter. Since features from the teacher and student models often differ in dimensionality or channel size, they must be transformed to a common space for effective comparison. Our proposed solution employs a lightweight autoencoder-style converter, consisting of simple linear layers for both the encoder and decoder.
  • Figure 4: Training strategy for PWL. The student model is optimized using a combination of five losses to support progressive layer replacement. Hard loss and soft loss supervise the student using ground-truth labels and teacher logits, respectively. Feature loss and reconstruction loss encourage alignment of intermediate representations between the student and teacher. Random cross loss mitigates performance degradation when individual student layers are replaced with corresponding teacher layers.
  • Figure 5: Loading time vs. CIFAR-10 accuracy for student, PWL stages, and teacher models across VGG, ResNet, and ViT architectures. PWL delivers inference speed on par with the student model while steadily increasing accuracy as layers are replaced with teacher weights.
  • ...and 3 more figures