Training the Untrainable: Introducing Inductive Bias via Representational Alignment
Vighnesh Subramaniam, David Mayo, Colin Conwell, Tomaso Poggio, Boris Katz, Brian Cheung, Andrei Barbu
TL;DR
The paper tackles the problem of training architectures that are traditionally viewed as ill-suited for certain tasks by injecting inductive biases from a guide network through layerwise representational alignment. It introduces a guidance framework where a target network optimizes its task loss together with a neural-distance-based alignment term against a frozen guide, enabling transfer of architectural priors even from untrained guides. Empirically, guidance improves performance across tasks (e.g., FCNs on ImageNet, RNNs on language modeling, and Transformers on certain sequence tasks), often with untrained guides providing substantial benefits, and reveals how architectural priors and training dynamics interact with representation space. The work offers a versatile tool for probing priors, narrowing architectural gaps, and potentially guiding future neural-architecture design and search, while highlighting the distinction between architectural and learned inductive biases.
Abstract
We demonstrate that architectures which traditionally are considered to be ill-suited for a task can be trained using inductive biases from another architecture. We call a network untrainable when it overfits, underfits, or converges to poor results even when tuning their hyperparameters. For example, fully connected networks overfit on object recognition while deep convolutional networks without residual connections underfit. The traditional answer is to change the architecture to impose some inductive bias, although the nature of that bias is unknown. We introduce guidance, where a guide network steers a target network using a neural distance function. The target minimizes its task loss plus a layerwise representational similarity against the frozen guide. If the guide is trained, this transfers over the architectural prior and knowledge of the guide to the target. If the guide is untrained, this transfers over only part of the architectural prior of the guide. We show that guidance prevents FCN overfitting on ImageNet, narrows the vanilla RNN-Transformer gap, boosts plain CNNs toward ResNet accuracy, and aids Transformers on RNN-favored tasks. We further identify that guidance-driven initialization alone can mitigate FCN overfitting. Our method provides a mathematical tool to investigate priors and architectures, and in the long term, could automate architecture design.
