Rethinking Knowledge Transfer in Learning Using Privileged Information

Danil Provodin; Bram van den Akker; Christina Katsimerou; Maurits Kaptein; Mykola Pechenizkiy

Rethinking Knowledge Transfer in Learning Using Privileged Information

Danil Provodin, Bram van den Akker, Christina Katsimerou, Maurits Kaptein, Mykola Pechenizkiy

TL;DR

This work critically reevaluates learning with privileged information (PI) by examining the theoretical underpinnings and empirical claims of knowledge transfer in LUPI. It analyzes two main PI-transfer mechanisms—knowledge distillation and TRAM-like marginalization—showing that strong assumptions and dataset-specific conditions often drive reported gains, not PI itself; extensive experiments reveal no robust PI transfer across multiple real-world datasets. The authors demonstrate that improvements frequently arise from training dynamics or architectural changes rather than PI, and that claims of faster learning rates or sample efficiency under PI are not generally supported. They call for cautious adoption of PI, urging the development of rigorous theoretical and empirical criteria to demonstrate genuine PI-induced transfer before applying LUPI in practice.

Abstract

In supervised machine learning, privileged information (PI) is information that is unavailable at inference, but is accessible during training time. Research on learning using privileged information (LUPI) aims to transfer the knowledge captured in PI onto a model that can perform inference without PI. It seems that this extra bit of information ought to make the resulting model better. However, finding conclusive theoretical or empirical evidence that supports the ability to transfer knowledge using PI has been challenging. In this paper, we critically examine the assumptions underlying existing theoretical analyses and argue that there is little theoretical justification for when LUPI should work. We analyze LUPI methods and reveal that apparent improvements in empirical risk of existing research may not directly result from PI. Instead, these improvements often stem from dataset anomalies or modifications in model design misguidedly attributed to PI. Our experiments for a wide variety of application domains further demonstrate that state-of-the-art LUPI approaches fail to effectively transfer knowledge from PI. Thus, we advocate for practitioners to exercise caution when working with PI to avoid unintended inductive biases.

Rethinking Knowledge Transfer in Learning Using Privileged Information

TL;DR

Abstract

Paper Structure (30 sections, 15 equations, 7 figures, 3 tables)

This paper contains 30 sections, 15 equations, 7 figures, 3 tables.

Introduction
Our contribution
Paper outline
Knowledge transfer in LUPI
Knowledge distillation
Marginalization and weight sharing
When is knowledge transfer in LUPI proven theoretically?
What does existing empirical evidence show?
Generalized distillation
Synthetic experiments from lopezpaz2016unifying
MNIST experiment from lopezpaz2016unifying
Further discussion on knowledge distillation using PI
Revisiting TRAM
Why TRAM does not leverage PI
Real-world applications
...and 15 more sections

Figures (7)

Figure 1: The effect of sufficient training epochs on the MNIST Generalised distillation experiment.
Figure 2: TRAM zeros, TRAM, and no-PI for \ref{['undertrained']} insufficient training and \ref{['sufficient']} sufficient training. The numbers in the legend indicate MSE loss with respect to the noise-free function.
Figure 3: TRAM and no-PI training dynamics for the synthetic experiment from Eq. \ref{['eq:tram_synthetic_regression']}. (Left) presents training dynamics over 200 epochs. (Right) shows the resulting models' performances across varying sample sizes trained for 200 epochs. "Uncorrupted" corresponds to a regular model fitted to uncorrupted data $y = \sin (2 \pi x) + \epsilon$.
Figure 4: Training dynamics of No PI, TRAM, Gen. dist., and Teacher for 4 real-world datasets averaged over 10 runs. (Top row) shows the performance metric on the test set (normalized roc auc score for Repeat Buyers and Heart Disease datasets and accuracy for NASA-NEO and Smoker or Drinker datasets). (Bottom row) shows cross-entropy loss on the test set.
Figure 5: Reproducing the SARCOS experiment with the teacher replaced with $g_t = 0$.
...and 2 more figures

Rethinking Knowledge Transfer in Learning Using Privileged Information

TL;DR

Abstract

Rethinking Knowledge Transfer in Learning Using Privileged Information

Authors

TL;DR

Abstract

Table of Contents

Figures (7)