Table of Contents
Fetching ...

Teasing Apart Architecture and Initial Weights as Sources of Inductive Bias in Neural Networks

Gianluca Bencomo, Max Gupta, Ioana Marinescu, R. Thomas McCoy, Thomas L. Griffiths

TL;DR

This work interrogates how inductive biases in neural networks arise from architecture versus initial weights. By meta-training initial weights across MLPs, CNNs, LSTMs, and Transformers on three tasks (concept learning, modular arithmetic, Omniglot few-shot), the authors show that meta-learned biases can substantially reduce cross-architecture performance differences and align learning trajectories, even across different data representations. However, robust generalization to tasks outside the meta-training distribution remains challenging, with catastrophic failures in extrapolative modular arithmetic tasks, and held-out alphabets in Omniglot still degrading performance. Overall, the results suggest that initial-weight biases can be as or more influential than architecture for many tasks, but architectural priors remain valuable for hard generalization, indicating a nuanced interplay between learning algorithms and structural constraints in shaping inductive biases.

Abstract

Artificial neural networks can acquire many aspects of human knowledge from data, making them promising as models of human learning. But what those networks can learn depends upon their inductive biases -- the factors other than the data that influence the solutions they discover -- and the inductive biases of neural networks remain poorly understood, limiting our ability to draw conclusions about human learning from the performance of these systems. Cognitive scientists and machine learning researchers often focus on the architecture of a neural network as a source of inductive bias. In this paper we explore the impact of another source of inductive bias -- the initial weights of the network -- using meta-learning as a tool for finding initial weights that are adapted for specific problems. We evaluate four widely-used architectures -- MLPs, CNNs, LSTMs, and Transformers -- by meta-training 430 different models across three tasks requiring different biases and forms of generalization. We find that meta-learning can substantially reduce or entirely eliminate performance differences across architectures and data representations, suggesting that these factors may be less important as sources of inductive bias than is typically assumed. When differences are present, architectures and data representations that perform well without meta-learning tend to meta-train more effectively. Moreover, all architectures generalize poorly on problems that are far from their meta-training experience, underscoring the need for stronger inductive biases for robust generalization.

Teasing Apart Architecture and Initial Weights as Sources of Inductive Bias in Neural Networks

TL;DR

This work interrogates how inductive biases in neural networks arise from architecture versus initial weights. By meta-training initial weights across MLPs, CNNs, LSTMs, and Transformers on three tasks (concept learning, modular arithmetic, Omniglot few-shot), the authors show that meta-learned biases can substantially reduce cross-architecture performance differences and align learning trajectories, even across different data representations. However, robust generalization to tasks outside the meta-training distribution remains challenging, with catastrophic failures in extrapolative modular arithmetic tasks, and held-out alphabets in Omniglot still degrading performance. Overall, the results suggest that initial-weight biases can be as or more influential than architecture for many tasks, but architectural priors remain valuable for hard generalization, indicating a nuanced interplay between learning algorithms and structural constraints in shaping inductive biases.

Abstract

Artificial neural networks can acquire many aspects of human knowledge from data, making them promising as models of human learning. But what those networks can learn depends upon their inductive biases -- the factors other than the data that influence the solutions they discover -- and the inductive biases of neural networks remain poorly understood, limiting our ability to draw conclusions about human learning from the performance of these systems. Cognitive scientists and machine learning researchers often focus on the architecture of a neural network as a source of inductive bias. In this paper we explore the impact of another source of inductive bias -- the initial weights of the network -- using meta-learning as a tool for finding initial weights that are adapted for specific problems. We evaluate four widely-used architectures -- MLPs, CNNs, LSTMs, and Transformers -- by meta-training 430 different models across three tasks requiring different biases and forms of generalization. We find that meta-learning can substantially reduce or entirely eliminate performance differences across architectures and data representations, suggesting that these factors may be less important as sources of inductive bias than is typically assumed. When differences are present, architectures and data representations that perform well without meta-learning tend to meta-train more effectively. Moreover, all architectures generalize poorly on problems that are far from their meta-training experience, underscoring the need for stronger inductive biases for robust generalization.

Paper Structure

This paper contains 23 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Input data for all 16 objects used in concept-learning with their bitstring and image representations.
  • Figure 2: Input data for modular arithmetic for 4 example numbers, with number, image, and bitstring representations.
  • Figure 3: Visualization of Meta-Validation curve fitting for Odd-Even task using a meta-trained LSTM with image inputs and 20 support points. LSTMs were meta-trained on odd moduli (shown above) and meta-tested on even moduli. Steps 0 denotes the function before observing the support set (green). Steps 1 (red) shows the adaptation after 1 step of gradient descent. True function (blue) denotes the ground truth moduli function.
  • Figure 4: Visualization of Meta-Test curve fitting for Odd-Even task using a meta-trained LSTM with image inputs and 20 support points. LSTMs were meta-trained on odd moduli and meta-tested on even moduli (shown above).
  • Figure 5: Visualization of Meta-Validation curve fitting for 20-20 task using a meta-trained LSTM with image inputs and 20 support points. LSTMs were meta-trained on moduli 1-20 (shown above) and meta-tested on moduli 21-40.
  • ...and 1 more figures