Table of Contents
Fetching ...

Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

Yi Ren, Danica J. Sutherland

TL;DR

It is shown that the compositional mappings are the simplest bijections through the lens of coding length (i.e., an upper bound of their Kolmogorov complexity) which explains why models having such mappings can generalize well.

Abstract

Obtaining compositional mappings is important for the model to generalize well compositionally. To better understand when and how to encourage the model to learn such mappings, we study their uniqueness through different perspectives. Specifically, we first show that the compositional mappings are the simplest bijections through the lens of coding length (i.e., an upper bound of their Kolmogorov complexity). This property explains why models having such mappings can generalize well. We further show that the simplicity bias is usually an intrinsic property of neural network training via gradient descent. That partially explains why some models spontaneously generalize well when they are trained appropriately.

Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

TL;DR

It is shown that the compositional mappings are the simplest bijections through the lens of coding length (i.e., an upper bound of their Kolmogorov complexity) which explains why models having such mappings can generalize well.

Abstract

Obtaining compositional mappings is important for the model to generalize well compositionally. To better understand when and how to encourage the model to learn such mappings, we study their uniqueness through different perspectives. Specifically, we first show that the compositional mappings are the simplest bijections through the lens of coding length (i.e., an upper bound of their Kolmogorov complexity). This property explains why models having such mappings can generalize well. We further show that the simplicity bias is usually an intrinsic property of neural network training via gradient descent. That partially explains why some models spontaneously generalize well when they are trained appropriately.
Paper Structure (11 sections, 4 equations, 5 figures, 1 table)

This paper contains 11 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The compositional generalization problem (a, b) and two types of bijections (c, d).
  • Figure 2: The evidence and explanations of the claim that simpler mappings are learned faster. The blue arrows in the last panel mean when learning the given example, the model increases its confidence in the corresponding prediction. The increase of the darker arrows is stronger than that of lighter ones because the confidence changes for the lighter ones are indirectly caused by the "elasticity" of the neural network He2020elasticity. For holistic mapping, the yellow arrows pointing down mean the corresponding confidence decreases after learning the example.
  • Figure 3: Four typical mappings studied in this paper and their coding strings.
  • Figure 4: Experiments for the one-hot vector inputs.
  • Figure 5: Experiments for the vision inputs.