Table of Contents
Fetching ...

Neural Coherence : Find higher performance to out-of-distribution tasks from few samples

Simon Guiroy, Mats Richter, Sarath Chandar, Christopher Pal

TL;DR

Neural Coherence introduces a data-efficient, activation-trajectory-based framework for model and data selection under distribution shift. By tracking multi-layer activation statistics across training hyperparameters and contrasting source and target trajectories, it identifies optimal checkpoints and training data with only a few unlabeled target samples. The approach is instantiated for checkpoint selection and data selection, and validated across meta-learning, zero-shot generalization, and transfer learning on large vision models, showing substantial improvements over traditional validation-based methods and several baselines. The work highlights activation dynamics as a rich signal for generalization under domain shift and offers a practical, architecture-agnostic criterion for robust pre-training and fine-tuning decisions.

Abstract

To create state-of-the-art models for many downstream tasks, it has become common practice to fine-tune a pre-trained large vision model. However, it remains an open question of how to best determine which of the many possible model checkpoints resulting from a large training run to use as the starting point. This becomes especially important when data for the target task of interest is scarce, unlabeled and out-of-distribution. In such scenarios, common methods relying on in-distribution validation data become unreliable or inapplicable. This work proposes a novel approach for model selection that operates reliably on just a few unlabeled examples from the target task. Our approach is based on a novel concept: Neural Coherence, which entails characterizing a model's activation statistics for source and target domains, allowing one to define model selection methods with high data-efficiency. We provide experiments where models are pre-trained on ImageNet1K and examine target domains consisting of Food-101, PlantNet-300K and iNaturalist. We also evaluate it in many meta-learning settings. Our approach significantly improves generalization across these different target domains compared to established baselines. We further demonstrate the versatility of Neural Coherence as a powerful principle by showing its effectiveness in training data selection.

Neural Coherence : Find higher performance to out-of-distribution tasks from few samples

TL;DR

Neural Coherence introduces a data-efficient, activation-trajectory-based framework for model and data selection under distribution shift. By tracking multi-layer activation statistics across training hyperparameters and contrasting source and target trajectories, it identifies optimal checkpoints and training data with only a few unlabeled target samples. The approach is instantiated for checkpoint selection and data selection, and validated across meta-learning, zero-shot generalization, and transfer learning on large vision models, showing substantial improvements over traditional validation-based methods and several baselines. The work highlights activation dynamics as a rich signal for generalization under domain shift and offers a practical, architecture-agnostic criterion for robust pre-training and fine-tuning decisions.

Abstract

To create state-of-the-art models for many downstream tasks, it has become common practice to fine-tune a pre-trained large vision model. However, it remains an open question of how to best determine which of the many possible model checkpoints resulting from a large training run to use as the starting point. This becomes especially important when data for the target task of interest is scarce, unlabeled and out-of-distribution. In such scenarios, common methods relying on in-distribution validation data become unreliable or inapplicable. This work proposes a novel approach for model selection that operates reliably on just a few unlabeled examples from the target task. Our approach is based on a novel concept: Neural Coherence, which entails characterizing a model's activation statistics for source and target domains, allowing one to define model selection methods with high data-efficiency. We provide experiments where models are pre-trained on ImageNet1K and examine target domains consisting of Food-101, PlantNet-300K and iNaturalist. We also evaluate it in many meta-learning settings. Our approach significantly improves generalization across these different target domains compared to established baselines. We further demonstrate the versatility of Neural Coherence as a powerful principle by showing its effectiveness in training data selection.

Paper Structure

This paper contains 23 sections, 22 equations, 23 figures, 5 tables.

Figures (23)

  • Figure 1: (component 1) For a neural network $f$ parametrized by $\theta$ and a set of inputs $\mathbf{x}$ drawn from a given data distribution $p(\mathbf{x})$, we analyze the distribution of their activations $\mathbf{z}$ across the network.
  • Figure 2: (component 2) We characterize the empirical distribution of activations $p(\mathbf{z})$ with a transformation $\psi$ that maps it to a low-dimensional point $\psi(\mathbf{z})$ in a vector space.
  • Figure 3: (component 3) By sweeping a training hyperparameter $\Omega$ (e.g. total number of training iterations, learning rate, batch size), we obtain a sequence of trained (or partially trained) models with parameter vector $\theta_{\Omega_i}^*$, for each value of $\Omega_i$. For a same given set of input $\mathbf{x}$, each such model yields a multivariate point $\psi(\mathbf{z}_{\Omega_i})$ characterizing its activations distribution. The sequence of those points constitute the "neural activation trajectory" $\psi(\mathbf{z}, \Omega)$ along the dimension $\Omega$.
  • Figure 4: (component 4) For a model with (training) source data $\mathbf{x}_{source}$ and to be deployed on a downstream or out-of-distribution task with target data $\mathbf{x}_{target}$, and given access to a few unlabeled examples from $\mathbf{x}_{target}$, we perform a contrastive analysis of the target activation trajectory $\psi(\mathbf{z}_T; \Omega)$, by comparing it with the source activation trajectory $\psi(\mathbf{z}_S; \Omega)$.
  • Figure 5: (component 5) The value for the hyperparameter $\Omega$ yielding optimal target loss is inferred and selected as the value $\Omega^*$, before which the target and source activation trajectories remain coherent, but after which they become divergent, i.e. they go in different directions according to a metric $d$. The model trained with $\Omega^*$ is inferred as giving the optimal target loss $\mathcal{L}^*_T$ for the target task. The intuition: before $\Omega^*$, the source and target losses curves $\mathcal{L}_S(\Omega)$ and $\mathcal{L}_T(\Omega)$ remain coherent, but diverge after $\Omega^*$, and this should be reflected by their activation trajectories $\psi(\mathbf{z}_S; \Omega)$ and $\psi(\mathbf{z}_T; \Omega)$.
  • ...and 18 more figures