Table of Contents
Fetching ...

Representation Transfer by Optimal Transport

Xuhong Li, Yves Grandvalet, Rémi Flamary, Nicolas Courty, Dejing Dou

TL;DR

This work tackles the problem of transferring rich representations from a fixed teacher to a student network by introducing an optimal transport–based regularizer that matches sets of neuron activations in a permutation-invariant way. By treating a layer as a neuronal ensemble and coupling activations through OT, the method integrates directly into the learning objective and supports cross-architecture transfer as well as model compression. Empirical results across transfer learning and model compression tasks show consistent gains over baselines, with notable advantages when transferring to smaller or data-scarce students. Analyses reveal that the approach captures meaningful neuron reallocation during learning and provides a practical, scalable framework for preserving functional representations in diverse learning settings.

Abstract

Learning generic representations with deep networks requires massive training samples and significant computer resources. To learn a new specific task, an important issue is to transfer the generic teacher's representation to a student network. In this paper, we propose to use a metric between representations that is based on a functional view of neurons. We use optimal transport to quantify the match between two representations, yielding a distance that embeds some invariances inherent to the representation of deep networks. This distance defines a regularizer promoting the similarity of the student's representation with that of the teacher. Our approach can be used in any learning context where representation transfer is applicable. We experiment here on two standard settings: inductive transfer learning, where the teacher's representation is transferred to a student network of same architecture for a new related task, and knowledge distillation, where the teacher's representation is transferred to a student of simpler architecture for the same task (model compression). Our approach also lends itself to solving new learning problems; we demonstrate this by showing how to directly transfer the teacher's representation to a simpler architecture student for a new related task.

Representation Transfer by Optimal Transport

TL;DR

This work tackles the problem of transferring rich representations from a fixed teacher to a student network by introducing an optimal transport–based regularizer that matches sets of neuron activations in a permutation-invariant way. By treating a layer as a neuronal ensemble and coupling activations through OT, the method integrates directly into the learning objective and supports cross-architecture transfer as well as model compression. Empirical results across transfer learning and model compression tasks show consistent gains over baselines, with notable advantages when transferring to smaller or data-scarce students. Analyses reveal that the approach captures meaningful neuron reallocation during learning and provides a practical, scalable framework for preserving functional representations in diverse learning settings.

Abstract

Learning generic representations with deep networks requires massive training samples and significant computer resources. To learn a new specific task, an important issue is to transfer the generic teacher's representation to a student network. In this paper, we propose to use a metric between representations that is based on a functional view of neurons. We use optimal transport to quantify the match between two representations, yielding a distance that embeds some invariances inherent to the representation of deep networks. This distance defines a regularizer promoting the similarity of the student's representation with that of the teacher. Our approach can be used in any learning context where representation transfer is applicable. We experiment here on two standard settings: inductive transfer learning, where the teacher's representation is transferred to a student network of same architecture for a new related task, and knowledge distillation, where the teacher's representation is transferred to a student of simpler architecture for the same task (model compression). Our approach also lends itself to solving new learning problems; we demonstrate this by showing how to directly transfer the teacher's representation to a simpler architecture student for a new related task.

Paper Structure

This paper contains 32 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Sketch of representation transfer by optimal transport. The trained network with frozen parameters (gray) and the network with trainable parameters (orange) process the input to compute the reference and trainable representations, $\mathbf{T}$ and $\mathbf{A}^{(t)}$ respectively. Then, the representation regularizer $\Omega_P$ is computed from these representations as in Section \ref{['section:regularizers-on-representations']}.
  • Figure 2: Ratios of fined-tuned neurons that match their initial activation, as estimated by the traces of the optimal transport plans, versus layer depth. The transport plans are computed, on the validation sets, between the representations formed at each layer of the pre-trained ResNet-101 and those of the model fine-tuned with $L^2$-SP.
  • Figure 3: Ratios of fined-tuned neurons that match their initial activation along training, as estimated by the traces of the optimal transport plans on each mini-batch. The plans are a by-product of the OT regularizer when fine-tuning with $\Omega_P$.
  • Figure S1: Ratios of fined-tuned neurons at the 19th layer that match their initial activation along training, as estimated by the traces of the optimal transport plans on each mini-batch.