Aligning Logits Generatively for Principled Black-Box Knowledge Distillation

Jing Ma; Xiang Xiang; Ke Wang; Yuchuan Wu; Yongbin Li

Aligning Logits Generatively for Principled Black-Box Knowledge Distillation

Jing Ma, Xiang Xiang, Ke Wang, Yuchuan Wu, Yongbin Li

TL;DR

This work tackles cloud-to-edge knowledge distillation when the teacher is a private black-box model and local data cannot be shared. It introduces Mapping-Emulation KD (MEKD), a two-step framework that first applies deprivatization by training a generator as an inverse mapping of the teacher and then distills the student by aligning high-dimensional generator-produced images, yielding a new optimization direction based on the empirical Wasserstein distance $\hat{W}$ rather than direct logit matching. Theoretical contributions include definitions of function equivalence via Wasserstein distance, the inverse-mapping property $f_G=f_T^{-1}$, and proofs of empirical approximation, optimization direction, and a generalization bound, all anchored by a cell decomposition of the latent space. Empirically, MEKD demonstrates robust improvements over existing B2KD approaches across MNIST, CIFAR, Tiny ImageNet, and ImageNet-1K with varying data budgets and query constraints, while maintaining privacy by avoiding leakage of sensitive local data. The approach offers a practical privacy-preserving, data-efficient pathway for cloud-to-edge KD in real-world, bandwidth-limited deployments.

Abstract

Black-Box Knowledge Distillation (B2KD) is a formulated problem for cloud-to-edge model compression with invisible data and models hosted on the server. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity of data distributions. In this paper, we formalize a two-step workflow consisting of deprivatization and distillation, and theoretically provide a new optimization direction from logits to cell boundary different from direct logits alignment. With its guidance, we propose a new method Mapping-Emulation KD (MEKD) that distills a black-box cumbersome model into a lightweight one. Our method does not differentiate between treating soft or hard responses, and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, and 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points. For different teacher-student pairs, our method yields inspiring distillation performance on various benchmarks, and outperforms the previous state-of-the-art approaches.

Aligning Logits Generatively for Principled Black-Box Knowledge Distillation

TL;DR

rather than direct logit matching. Theoretical contributions include definitions of function equivalence via Wasserstein distance, the inverse-mapping property

, and proofs of empirical approximation, optimization direction, and a generalization bound, all anchored by a cell decomposition of the latent space. Empirically, MEKD demonstrates robust improvements over existing B2KD approaches across MNIST, CIFAR, Tiny ImageNet, and ImageNet-1K with varying data budgets and query constraints, while maintaining privacy by avoiding leakage of sensitive local data. The approach offers a practical privacy-preserving, data-efficient pathway for cloud-to-edge KD in real-world, bandwidth-limited deployments.

Abstract

Paper Structure (10 sections, 6 theorems, 45 equations, 9 figures, 10 tables, 1 algorithm)

This paper contains 10 sections, 6 theorems, 45 equations, 9 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Theory for Mapping-Emulation KD
Algorithm of Mapping-Emulation KD
Experiments
Experiment Setup
Performance Evaluation
Ablation Study
Extended Experiments
Conclusion

Key Result

Theorem 1

(Empirical Approximation) For any $0<\epsilon<1/2$ and any integer $m>4$, let $g:\mathbb{R}^C\rightarrow\mathbb{R}^n$ be the mapping function of generator $G$ with $n\leq\frac{20\log m}{\epsilon^2}$. For two sets $V_S=\{y_S:y_S\in \mathbb{P}_S\}$ and $V_T=\{y_T:y_T\in \mathbb{P}_T\}$, both of which then $W(\mathbb{P}_S, \mathbb{P}_T)=0$.

Figures (9)

Figure 1: Schematic process of cloud-to-edge model compression. A cumbersome black-box model is deployed on a cloud server, trained with millions of samples and tags. The cloud server only provides APIs to receive query data and return inference responses of either soft or hard type. The edge device needs to distill a lightweight model using unlabeled local data.
Figure 2: The overall framework of MEKD. Lower left: two architectures of GAN-based KD. Upper right: the process of deprivatization. GAN is used to synthetic high-response images to the teacher model within the distribution of data in edge devices. Lower right: the process of distillation with the frozen generator. The synthetic privacy-free images are query samples sent to the teacher model through the APIs of cloud servers. The student model is distilled by reducing the logit-level and image-level discrepancy.
Figure 3: Mapping relationships of $f_S,f_T,f_G$. If $f_S$ and $f_T$ can map $\mu$ to the same distribution $\upsilon$, then $f_S=f_T$, and if $f_G$ can map the prior distribution $p$ to $\mu$, then $f_G=f_T^{-1}$.
Figure 4: Cell $U_{\alpha}$ in the latent space is mapped via $f_G$ to an exact image $x^{(i)}$ of the same color. The move of point $x_S'$ to $x_T'$ causes the logits $y_S$ to align with $y_T$ from a direction different from $\mathcal{L}_{KL}$.
Figure 5: Real images of CIFAR-10 (a) and synthetic images using MEKD with $\mathcal{L}_{IM}$ (b) and without $\mathcal{L}_{IM}$ (c).
...and 4 more figures

Theorems & Definitions (13)

Definition 1
Definition 2
Theorem 1
Theorem 2
Theorem 3
Definition 3
Definition 4
Theorem 4
proof
Theorem 5
...and 3 more

Aligning Logits Generatively for Principled Black-Box Knowledge Distillation

TL;DR

Abstract

Aligning Logits Generatively for Principled Black-Box Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (13)