Table of Contents
Fetching ...

Towards Exact Computation of Inductive Bias

Akhilan Boopathy, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete

TL;DR

The proposed inductive bias metric provides an information-theoretic interpretation of the benefits of specific model architectures for certain tasks and provides a quantitative guide to developing tasks requiring greater inductive bias, thereby encouraging the development of more powerful inductive biases.

Abstract

Much research in machine learning involves finding appropriate inductive biases (e.g. convolutional neural networks, momentum-based optimizers, transformers) to promote generalization on tasks. However, quantification of the amount of inductive bias associated with these architectures and hyperparameters has been limited. We propose a novel method for efficiently computing the inductive bias required for generalization on a task with a fixed training data budget; formally, this corresponds to the amount of information required to specify well-generalizing models within a specific hypothesis space of models. Our approach involves modeling the loss distribution of random hypotheses drawn from a hypothesis space to estimate the required inductive bias for a task relative to these hypotheses. Unlike prior work, our method provides a direct estimate of inductive bias without using bounds and is applicable to diverse hypothesis spaces. Moreover, we derive approximation error bounds for our estimation approach in terms of the number of sampled hypotheses. Consistent with prior results, our empirical results demonstrate that higher dimensional tasks require greater inductive bias. We show that relative to other expressive model classes, neural networks as a model class encode large amounts of inductive bias. Furthermore, our measure quantifies the relative difference in inductive bias between different neural network architectures. Our proposed inductive bias metric provides an information-theoretic interpretation of the benefits of specific model architectures for certain tasks and provides a quantitative guide to developing tasks requiring greater inductive bias, thereby encouraging the development of more powerful inductive biases.

Towards Exact Computation of Inductive Bias

TL;DR

The proposed inductive bias metric provides an information-theoretic interpretation of the benefits of specific model architectures for certain tasks and provides a quantitative guide to developing tasks requiring greater inductive bias, thereby encouraging the development of more powerful inductive biases.

Abstract

Much research in machine learning involves finding appropriate inductive biases (e.g. convolutional neural networks, momentum-based optimizers, transformers) to promote generalization on tasks. However, quantification of the amount of inductive bias associated with these architectures and hyperparameters has been limited. We propose a novel method for efficiently computing the inductive bias required for generalization on a task with a fixed training data budget; formally, this corresponds to the amount of information required to specify well-generalizing models within a specific hypothesis space of models. Our approach involves modeling the loss distribution of random hypotheses drawn from a hypothesis space to estimate the required inductive bias for a task relative to these hypotheses. Unlike prior work, our method provides a direct estimate of inductive bias without using bounds and is applicable to diverse hypothesis spaces. Moreover, we derive approximation error bounds for our estimation approach in terms of the number of sampled hypotheses. Consistent with prior results, our empirical results demonstrate that higher dimensional tasks require greater inductive bias. We show that relative to other expressive model classes, neural networks as a model class encode large amounts of inductive bias. Furthermore, our measure quantifies the relative difference in inductive bias between different neural network architectures. Our proposed inductive bias metric provides an information-theoretic interpretation of the benefits of specific model architectures for certain tasks and provides a quantitative guide to developing tasks requiring greater inductive bias, thereby encouraging the development of more powerful inductive biases.
Paper Structure (26 sections, 1 theorem, 51 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 1 theorem, 51 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Theorem 2

Suppose we are provided a hypothesis distribution $p_h$, input distribution $p_x$, loss function $L$ and desired error rate $\varepsilon$. Suppose we estimate $I(\varepsilon, p_h, p_x, L)$ by first sampling $n$ hypotheses ($h^1, h^2, ... h^n$) iid from $q_h$ which is close to $p_h$ in the sense that for all $h$. We then compute the test losses of each hypothesis $\mathbb{E}_{x \sim p_x}[L(h^1, x)]

Figures (5)

  • Figure 1: An illustration of example hypothesis spaces, model classes, and specific models for a particular learning problem. Red circles indicate training points and black curves indicate hypotheses. A hypothesis space sets the broad set of models we wish to consider. In this illustration, we consider the hypothesis space of all functions and a smaller hypothesis space of band-limited functions (i.e. functions with limited maximum frequency). A model class is a set of models associated with a particular set of inductive biases. We measure the required amount of inductive bias to solve a task based on the size of the well-generalizing region within the context of a particular hypothesis space.
  • Figure 2: Illustration of how the required inductive bias for a task can be computed from the hypothesis space and the region of well-generalizing hypotheses. Black boxes indicate hypothesis spaces; $p_h$ is a uniform distribution over each box. Purple indicates regions of well-generalizing hypotheses. Inductive bias is the negative log of the fraction of hypothesis space that generalizes well: $I = -\log \frac{Hypothesis\,space \cap Well-generalizing\,hypotheses}{Hypothesis\,space}$. It depends on both the size of the hypothesis space as well as how much the hypothesis space overlaps with well-generalizing hypotheses. Different hypothesis spaces may yield different inductive bias estimates even on the same task (i.e. the same set of well-generalizing hypotheses).
  • Figure 3: Fitting a scaled non-central Chi-squared distribution to an empirical distribution of mean squared errors of models drawn from a kernel-based Gaussian RBF hypothesis space on a restricted version of MNIST. Observe that the distribution closely models the empirical distribution.
  • Figure 4: Distribution of hypothesis losses for MNIST after 5, 10, and 15 epochs of gradient descent. Notice that the update in distribution is minimal.
  • Figure 5: Fitted scaled non-central Chi-squared distributions for the test set errors on MNIST, CIFAR-10, Omniglot, and Inverted Pendulum tasks under a Gaussian RBF kernel hypothesis space.

Theorems & Definitions (3)

  • Definition 1
  • Theorem 2
  • proof