Table of Contents
Fetching ...

Photonic Quantum-Enhanced Knowledge Distillation

Kuan-Cheng Chen, Shang Yu, Chen-Yu Liu, Samuel Yen-Chi Chen, Huan-Hsin Tseng, Yen Jui Chang, Wei-Hao Huang, Felix Burt, Esperanza Cuenca Gomez, Zohim Chandani, William Clements, Ian Walmsley, Kin K. Leung

Abstract

Photonic quantum processors naturally produce intrinsically stochastic measurement outcomes, offering a hardware-native source of structured randomness that can be exploited during machine-learning training. Here we introduce Photonic Quantum-Enhanced Knowledge Distillation (PQKD), a hybrid quantum photonic--classical framework in which a programmable photonic circuit generates a compact conditioning signal that constrains and guides a parameter-efficient student network during distillation from a high-capacity teacher. PQKD replaces fully trainable convolutional kernels with dictionary convolutions: each layer learns only a small set of shared spatial basis filters, while sample-dependent channel-mixing weights are derived from shot-limited photonic features and mapped through a fixed linear transform. Training alternates between standard gradient-based optimisation of the student and sampling-robust, gradient-free updates of photonic parameters, avoiding differentiation through photonic hardware. Across MNIST, Fashion-MNIST and CIFAR-10, PQKD traces a controllable compression--accuracy frontier, remaining close to teacher performance on simpler benchmarks under aggressive convolutional compression. Performance degrades predictably with finite sampling, consistent with shot-noise scaling, and exponential moving-average feature smoothing suppresses high-frequency shot-noise fluctuations, extending the practical operating regime at moderate shot budgets.

Photonic Quantum-Enhanced Knowledge Distillation

Abstract

Photonic quantum processors naturally produce intrinsically stochastic measurement outcomes, offering a hardware-native source of structured randomness that can be exploited during machine-learning training. Here we introduce Photonic Quantum-Enhanced Knowledge Distillation (PQKD), a hybrid quantum photonic--classical framework in which a programmable photonic circuit generates a compact conditioning signal that constrains and guides a parameter-efficient student network during distillation from a high-capacity teacher. PQKD replaces fully trainable convolutional kernels with dictionary convolutions: each layer learns only a small set of shared spatial basis filters, while sample-dependent channel-mixing weights are derived from shot-limited photonic features and mapped through a fixed linear transform. Training alternates between standard gradient-based optimisation of the student and sampling-robust, gradient-free updates of photonic parameters, avoiding differentiation through photonic hardware. Across MNIST, Fashion-MNIST and CIFAR-10, PQKD traces a controllable compression--accuracy frontier, remaining close to teacher performance on simpler benchmarks under aggressive convolutional compression. Performance degrades predictably with finite sampling, consistent with shot-noise scaling, and exponential moving-average feature smoothing suppresses high-frequency shot-noise fluctuations, extending the practical operating regime at moderate shot budgets.
Paper Structure (120 sections, 99 equations, 5 figures, 3 tables, 3 algorithms)

This paper contains 120 sections, 99 equations, 5 figures, 3 tables, 3 algorithms.

Figures (5)

  • Figure 1: Hardware--algorithm co-design for PQKD enabling neural network compression. A CV photonic module prepares a fixed input state $\rho_{\mathrm{in}}$ and applies a programmable unitary $U(\theta)$ implemented by an integrated interferometric mesh. The output state $\rho_\theta = U(\theta)\rho_{\mathrm{in}}U^\dagger(\theta)$ is measured using an SNSPD array with time-tagging, producing i.i.d. samples $\{\omega_s\}_{s=1}^S$ and an empirical distribution $\hat{p}_\theta$. A robust classical feature extractor $\Phi(\cdot)$ maps $\hat{p}_\theta$ to a fixed-length conditioning vector $z(\theta)=\Phi(\hat{p}_\theta)\in\mathbb{R}^{d}$, which modulates the student model $f_S$ during knowledge distillation from a pretrained teacher $f_T$. Crucially, $\theta$ is updated by a classical optimiser using a distillation objective $L_{\mathrm{KD}}(\omega,\theta)$ and requires sampling only, avoiding differentiation through photonic hardware.
  • Figure 2: Training and validation dynamics across datasets. Mean $\pm$ s.d. over five independent runs comparing the teacher (green) and PQKD student (blue) on MNIST a , Fashion-MNIST b, and CIFAR-10 c. Top row: training accuracy. Middle row: validation accuracy. Bottom row: cross-entropy (CE) loss for training (solid) and validation (dashed)
  • Figure 3: Scaling photonic-conditioned convolutional compression across network scope and teacher capacity on MNIST. Validation accuracy drop relative to the corresponding teacher in percentage points, plotted against the overall compression factor of the compressed network (bottom axis; $\texttt{compression\_x}$). Columns report the compression scope: Conv1 (left), Conv1+Conv2 (middle), and All Convs (right). Rows correspond to three teacher widths, $(c_1,c_2,c_3)\in\{(32,64,128),(48,96,128),(64,128,128)\}$, with each row aggregating all configurations for that teacher. Marker shape encodes the photonic parameter dimension $\dim(\theta)\in\{15,30,45\}$, and marker size encodes the kernel-manifold rank $R\in\{4,8,12\}$. Faint markers show individual seeds, while opaque markers and vertical bars indicate the mean and one standard deviation across seeds.
  • Figure 4: Shot-limited scaling of the photonic conditioning signal in PQKD. Increase in test accuracy due to the photonic feature, $\delta\,\mathrm{test\ acc} \equiv \mathrm{acc}(z)-\mathrm{acc}(z{=}0)$, as a function of the photonic measurement budget $S$ (shots), comparing EMA aggregation of the feature across epochs (blue) with no EMA (orange). Solid curves show the mean over random seeds and shaded bands denote $\pm 1$ s.d. The dashed curves are fits to the shot-noise model $\delta(S)=\delta_\infty-k/\sqrt{S}$, consistent with multinomial histogram fluctuations propagating through a Lipschitz feature map, which predicts a $1/\sqrt{S}$ decay of stochastic feature perturbations and a corresponding saturation of performance at large $S$.
  • Figure S1: EMA suppresses shot-noise fluctuations in the photonic conditioning signal. Feature traces recorded during PQKD training at fixed shot budget compare the raw per-epoch photonic feature (pre-EMA) with the EMA-smoothed feature used by the student. (a) A representative feature coordinate illustrates attenuation of high-frequency sampling jitter under EMA. (b) Windowed value distributions across feature dimensions for the raw and EMA-used signals; the dashed curve indicates the removed residual $(z_{\mathrm{raw}}-z_{\mathrm{EMA}})$. (c) Empirical CDF of the per-dimension variance ratio $\mathrm{Var}(z_{\mathrm{EMA}})/\mathrm{Var}(z_{\mathrm{raw}})$ with the expected attenuation level shown for reference.