Table of Contents
Fetching ...

Training the Untrainable: Introducing Inductive Bias via Representational Alignment

Vighnesh Subramaniam, David Mayo, Colin Conwell, Tomaso Poggio, Boris Katz, Brian Cheung, Andrei Barbu

TL;DR

The paper tackles the problem of training architectures that are traditionally viewed as ill-suited for certain tasks by injecting inductive biases from a guide network through layerwise representational alignment. It introduces a guidance framework where a target network optimizes its task loss together with a neural-distance-based alignment term against a frozen guide, enabling transfer of architectural priors even from untrained guides. Empirically, guidance improves performance across tasks (e.g., FCNs on ImageNet, RNNs on language modeling, and Transformers on certain sequence tasks), often with untrained guides providing substantial benefits, and reveals how architectural priors and training dynamics interact with representation space. The work offers a versatile tool for probing priors, narrowing architectural gaps, and potentially guiding future neural-architecture design and search, while highlighting the distinction between architectural and learned inductive biases.

Abstract

We demonstrate that architectures which traditionally are considered to be ill-suited for a task can be trained using inductive biases from another architecture. We call a network untrainable when it overfits, underfits, or converges to poor results even when tuning their hyperparameters. For example, fully connected networks overfit on object recognition while deep convolutional networks without residual connections underfit. The traditional answer is to change the architecture to impose some inductive bias, although the nature of that bias is unknown. We introduce guidance, where a guide network steers a target network using a neural distance function. The target minimizes its task loss plus a layerwise representational similarity against the frozen guide. If the guide is trained, this transfers over the architectural prior and knowledge of the guide to the target. If the guide is untrained, this transfers over only part of the architectural prior of the guide. We show that guidance prevents FCN overfitting on ImageNet, narrows the vanilla RNN-Transformer gap, boosts plain CNNs toward ResNet accuracy, and aids Transformers on RNN-favored tasks. We further identify that guidance-driven initialization alone can mitigate FCN overfitting. Our method provides a mathematical tool to investigate priors and architectures, and in the long term, could automate architecture design.

Training the Untrainable: Introducing Inductive Bias via Representational Alignment

TL;DR

The paper tackles the problem of training architectures that are traditionally viewed as ill-suited for certain tasks by injecting inductive biases from a guide network through layerwise representational alignment. It introduces a guidance framework where a target network optimizes its task loss together with a neural-distance-based alignment term against a frozen guide, enabling transfer of architectural priors even from untrained guides. Empirically, guidance improves performance across tasks (e.g., FCNs on ImageNet, RNNs on language modeling, and Transformers on certain sequence tasks), often with untrained guides providing substantial benefits, and reveals how architectural priors and training dynamics interact with representation space. The work offers a versatile tool for probing priors, narrowing architectural gaps, and potentially guiding future neural-architecture design and search, while highlighting the distinction between architectural and learned inductive biases.

Abstract

We demonstrate that architectures which traditionally are considered to be ill-suited for a task can be trained using inductive biases from another architecture. We call a network untrainable when it overfits, underfits, or converges to poor results even when tuning their hyperparameters. For example, fully connected networks overfit on object recognition while deep convolutional networks without residual connections underfit. The traditional answer is to change the architecture to impose some inductive bias, although the nature of that bias is unknown. We introduce guidance, where a guide network steers a target network using a neural distance function. The target minimizes its task loss plus a layerwise representational similarity against the frozen guide. If the guide is trained, this transfers over the architectural prior and knowledge of the guide to the target. If the guide is untrained, this transfers over only part of the architectural prior of the guide. We show that guidance prevents FCN overfitting on ImageNet, narrows the vanilla RNN-Transformer gap, boosts plain CNNs toward ResNet accuracy, and aids Transformers on RNN-favored tasks. We further identify that guidance-driven initialization alone can mitigate FCN overfitting. Our method provides a mathematical tool to investigate priors and architectures, and in the long term, could automate architecture design.

Paper Structure

This paper contains 35 sections, 19 equations, 18 figures, 6 tables, 1 algorithm.

Figures (18)

  • Figure 1: Guidance makes untrainable networks trainable via representational similarity. Given a target which cannot be trained effectively on a task, we train this target with a layerwise representational‐alignment term against a fixed guide—trained or random—which remains unchanged during task training. This transfers only the guide’s architectural bias, turning a network that would otherwise overfit or underfit into one that learns effectively (e.g., a deep FCN guided by a random ResNet for image classification).
  • Figure 2: Training and validation under guidance for all experiments reported in \ref{['tab:networks']}. For every result in \ref{['tab:image-class']} and \ref{['tab:seq-model']}, we show the training and validation loss with error bars across multiple runs, although these are often too small to see. Note that often the best results occur with the untrained guide.
  • Figure 3: Guidance aligns error consistency. The relationship between the guide networks is mirrored in that of the guided networks, even when the target is entirely unlike the guides initially. This is additional evidence that guidance doesn't just improve performance arbitrarily; the target becomes more like the guide.
  • Figure 4: Initializing fully connected networks with guidance can overcome overfitting. First, we align a Deep FCN to a random ResNet-18 on noise for 300 steps, then train normally. This two-stage scheme mirrors full guidance, and leads to a similar performance improvement. This suggests that FCNs have guidance-inspired initializations that avoid overfitting.
  • Figure 5: Guidance outperforms distillation: We include a comparison between guidance and distillation for all settings with trained and untrained guide networks/teacher networks. We find that guidance outperforms distillation in all settings, highlighting that, unlike guidance, distillation fails in settings with an untrained teacher.
  • ...and 13 more figures