Aligning Logits Generatively for Principled Black-Box Knowledge Distillation
Jing Ma, Xiang Xiang, Ke Wang, Yuchuan Wu, Yongbin Li
TL;DR
This work tackles cloud-to-edge knowledge distillation when the teacher is a private black-box model and local data cannot be shared. It introduces Mapping-Emulation KD (MEKD), a two-step framework that first applies deprivatization by training a generator as an inverse mapping of the teacher and then distills the student by aligning high-dimensional generator-produced images, yielding a new optimization direction based on the empirical Wasserstein distance $\hat{W}$ rather than direct logit matching. Theoretical contributions include definitions of function equivalence via Wasserstein distance, the inverse-mapping property $f_G=f_T^{-1}$, and proofs of empirical approximation, optimization direction, and a generalization bound, all anchored by a cell decomposition of the latent space. Empirically, MEKD demonstrates robust improvements over existing B2KD approaches across MNIST, CIFAR, Tiny ImageNet, and ImageNet-1K with varying data budgets and query constraints, while maintaining privacy by avoiding leakage of sensitive local data. The approach offers a practical privacy-preserving, data-efficient pathway for cloud-to-edge KD in real-world, bandwidth-limited deployments.
Abstract
Black-Box Knowledge Distillation (B2KD) is a formulated problem for cloud-to-edge model compression with invisible data and models hosted on the server. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity of data distributions. In this paper, we formalize a two-step workflow consisting of deprivatization and distillation, and theoretically provide a new optimization direction from logits to cell boundary different from direct logits alignment. With its guidance, we propose a new method Mapping-Emulation KD (MEKD) that distills a black-box cumbersome model into a lightweight one. Our method does not differentiate between treating soft or hard responses, and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, and 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points. For different teacher-student pairs, our method yields inspiring distillation performance on various benchmarks, and outperforms the previous state-of-the-art approaches.
