ProARD: progressive adversarial robustness distillation: provide wide range of robust students
Seyedhamidreza Mousavi, Seyedali Mousavi, Masoud Daneshtalab
TL;DR
ProARD tackles the challenge of deploying robust lightweight models across diverse edge devices by training a single dynamic network that supports a vast family of robust student architectures. It uses progressive sampling to train a dynamic teacher with weight-sharing across many students, followed by an accuracy-robustness predictor and NSGA-II based multi-objective search to select optimal students under FLOPs constraints. The approach yields substantial training-cost reductions (up to 60x) and robust, accurate students at the same FLOPs on CIFAR-10/100 with ResNet and MobileNet backbones. This framework enables scalable, eco-friendly deployment of robust models without retraining for each target platform.
Abstract
Adversarial Robustness Distillation (ARD) has emerged as an effective method to enhance the robustness of lightweight deep neural networks against adversarial attacks. Current ARD approaches have leveraged a large robust teacher network to train one robust lightweight student. However, due to the diverse range of edge devices and resource constraints, current approaches require training a new student network from scratch to meet specific constraints, leading to substantial computational costs and increased CO2 emissions. This paper proposes Progressive Adversarial Robustness Distillation (ProARD), enabling the efficient one-time training of a dynamic network that supports a diverse range of accurate and robust student networks without requiring retraining. We first make a dynamic deep neural network based on dynamic layers by encompassing variations in width, depth, and expansion in each design stage to support a wide range of architectures. Then, we consider the student network with the largest size as the dynamic teacher network. ProARD trains this dynamic network using a weight-sharing mechanism to jointly optimize the dynamic teacher network and its internal student networks. However, due to the high computational cost of calculating exact gradients for all the students within the dynamic network, a sampling mechanism is required to select a subset of students. We show that random student sampling in each iteration fails to produce accurate and robust students.
