Table of Contents
Fetching ...

An Information-Geometric Distance on the Space of Tasks

Yansong Gao, Pratik Chaudhari

TL;DR

This work introduces the coupled transfer distance, an information-geometric measure of task distance that jointly transports the input data distribution between source and target tasks and updates the classifier along a geodesic on the Fisher-Rao manifold. Grounded in information geometry and optimal transport, the method defines the distance as the length of the weight trajectory under the Fisher Information Metric while the task evolves via an OT map with ground metric given by the Fisher-Rao distance between conditional distributions. The paper provides an algorithm that alternates updating the OT coupling and the weight trajectory, along with practical tricks to scale to real datasets and a Rademacher-complexity perspective linking geometric length to generalization. Empirically, the coupled distance correlates with fine-tuning difficulty across MNIST, CIFAR, and DeepFashion, and benefits from higher model capacity, offering a principled, architecture-agnostic tool to assess and compare task similarity for transfer learning. Overall, the approach advances theoretical and empirical understanding of transfer difficulty by incorporating the hypothesis space into task distance and showing practical improvements in transfer outcomes.

Abstract

This paper prescribes a distance between learning tasks modeled as joint distributions on data and labels. Using tools in information geometry, the distance is defined to be the length of the shortest weight trajectory on a Riemannian manifold as a classifier is fitted on an interpolated task. The interpolated task evolves from the source to the target task using an optimal transport formulation. This distance, which we call the "coupled transfer distance" can be compared across different classifier architectures. We develop an algorithm to compute the distance which iteratively transports the marginal on the data of the source task to that of the target task while updating the weights of the classifier to track this evolving data distribution. We develop theory to show that our distance captures the intuitive idea that a good transfer trajectory is the one that keeps the generalization gap small during transfer, in particular at the end on the target task. We perform thorough empirical validation and analysis across diverse image classification datasets to show that the coupled transfer distance correlates strongly with the difficulty of fine-tuning.

An Information-Geometric Distance on the Space of Tasks

TL;DR

This work introduces the coupled transfer distance, an information-geometric measure of task distance that jointly transports the input data distribution between source and target tasks and updates the classifier along a geodesic on the Fisher-Rao manifold. Grounded in information geometry and optimal transport, the method defines the distance as the length of the weight trajectory under the Fisher Information Metric while the task evolves via an OT map with ground metric given by the Fisher-Rao distance between conditional distributions. The paper provides an algorithm that alternates updating the OT coupling and the weight trajectory, along with practical tricks to scale to real datasets and a Rademacher-complexity perspective linking geometric length to generalization. Empirically, the coupled distance correlates with fine-tuning difficulty across MNIST, CIFAR, and DeepFashion, and benefits from higher model capacity, offering a principled, architecture-agnostic tool to assess and compare task similarity for transfer learning. Overall, the approach advances theoretical and empirical understanding of transfer difficulty by incorporating the hypothesis space into task distance and showing practical improvements in transfer outcomes.

Abstract

This paper prescribes a distance between learning tasks modeled as joint distributions on data and labels. Using tools in information geometry, the distance is defined to be the length of the shortest weight trajectory on a Riemannian manifold as a classifier is fitted on an interpolated task. The interpolated task evolves from the source to the target task using an optimal transport formulation. This distance, which we call the "coupled transfer distance" can be compared across different classifier architectures. We develop an algorithm to compute the distance which iteratively transports the marginal on the data of the source task to that of the target task while updating the weights of the classifier to track this evolving data distribution. We develop theory to show that our distance captures the intuitive idea that a good transfer trajectory is the one that keeps the generalization gap small during transfer, in particular at the end on the target task. We perform thorough empirical validation and analysis across diverse image classification datasets to show that the coupled transfer distance correlates strongly with the difficulty of fine-tuning.

Paper Structure

This paper contains 26 sections, 3 theorems, 32 equations, 8 figures.

Key Result

Theorem 5

Given a weight trajectory $\left\{{w(\tau)}\right\}_{\tau \in [0,1]}$ and a sequence $0 = \tau_0 \leq \tau_1 <\tau_2 <...<\tau_{K}\leq 1$, for all $\epsilon > 2 \sum_{k= 1}^{K}(\tau_k - \tau_{k-1})\mathop{\mathrm{\mathbb{E}}}\limits_{x \sim p_{\tau}} |\Delta \ell(w(\tau_{k-1}))|$, the probability th is greater than $\epsilon$ is upper bounded by We have defined $\Delta \tau_k = \tau_k - \tau_{k-1

Figures (8)

  • Figure 1: Coupled transfer of the data and the conditional distribution. We solve an optimization problem that transports the source data distribution ${p_s}(x)$ to the target distribution ${p_t}(x)$ as $\tau \to 1$ while simultaneously updating the model using samples from the interpolated distribution ${p_\tau}(x)$. This modifies the conditional distribution $p_{w_s}(y|x)$ on the source task to the corresponding distribution on the target task $p_{w_t}(y|x)$. The "coupled transfer distance" between source and target tasks is the length of the shortest such weight trajectory under the Fisher Information Metric.
  • Figure 2: \ref{['fig:CIFAR10']} shows coupled transfer distance ($r$ = 0.428 $p$ = 0.13), \ref{['fig:CIFAR10-Task2Vec']} shows distances estimated using Task2Vec ($r$ = 0.03, $p$ = 0.98), \ref{['fig:CIFAR10_FineTune']} shows fine-tuning distance ($r$ = 0.61, $p$ = 0.09 with itself). The numerical values of the distances in this figure are not comparable with each other. Coupled transfer distances satisfy certain sanity checks, e.g., transferring to a subset task is easier than transferring from a subset task (CIFAR-10-vehicles/animals), which Task2Vec does not.
  • Figure 3: \ref{['fig:CNN']} shows coupled transfer distance ($r$ = 0.14, $p$ = 0.05), \ref{['fig:Task2Vec']} shows Task2Vec distance ($r$ = 0.07, $p$ = 0.17), \ref{['fig:CNN_FineTune']} shows fine-tuning distance ($r$ = 0.36, $p$ = 0.03), and \ref{['fig:CNN_uncouplings']} shows uncoupled transfer distance ($r$ = 0.12, $p$ = 0.47). Numerical values in the first and the last sub-plot can be compared directly. Coupled transfer broadly agrees with fine-tuning except for carnivores-flowers and herbivores-vehicles-1. For all tasks, uncoupled transfer overestimates the distances compared to \ref{['fig:CNN']}.
  • Figure 4: \ref{['fig:Fig-4a']} shows the evolution of the training and test cross-entropy loss on the interpolated distribution as a function of the transfer steps in the final iteration of coupled transfer of vehicles-1-vehicles-2. As predicted by \ref{['thm:integral_of_fisher_dist_main']}, generalization gap along the trajectory is small. \ref{['fig:Fig-4b']} shows the convergence of the task distance with the number of iterations $k$ in \ref{['eq:algorithm']}; the distance typically converges in 4--5 iterations for these tasks.
  • Figure 5: \ref{['fig:wrn164_ours']} shows coupled transfer distance ($r$ = 0.15, $p$ = 0.01) and \ref{['fig:wrn164_finetune']} shows fine-tuning distance ($r$ = 0.39, $p$ = 0.01 with itself and $r$ = 0.21, $p$ = 0.20 with fine-tuning distance in \ref{['fig:CNN_FineTune']}). Numbers in \ref{['fig:wrn164_ours']} can be directly compared to those in \ref{['fig:CNN']}. WRN-16-4 model has a shorter trajectory for all task pairs compared to the CNN in \ref{['fig:CNN']} with fewer parameters.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Remark 1: Uncoupled transfer distance entails longer weight trajectories
  • Definition 2: Coupled transfer distance
  • Remark 3: Coupled transfer distance is asymmetric
  • Remark 4: Coupled transfer distance can be compared across different architectures
  • Theorem 5
  • Theorem 6
  • Theorem 7