Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

Yi Ren; Danica J. Sutherland

Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

Yi Ren, Danica J. Sutherland

TL;DR

It is shown that the compositional mappings are the simplest bijections through the lens of coding length (i.e., an upper bound of their Kolmogorov complexity) which explains why models having such mappings can generalize well.

Abstract

Obtaining compositional mappings is important for the model to generalize well compositionally. To better understand when and how to encourage the model to learn such mappings, we study their uniqueness through different perspectives. Specifically, we first show that the compositional mappings are the simplest bijections through the lens of coding length (i.e., an upper bound of their Kolmogorov complexity). This property explains why models having such mappings can generalize well. We further show that the simplicity bias is usually an intrinsic property of neural network training via gradient descent. That partially explains why some models spontaneously generalize well when they are trained appropriately.

Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

TL;DR

Abstract

Paper Structure (11 sections, 4 equations, 5 figures, 1 table)

This paper contains 11 sections, 4 equations, 5 figures, 1 table.

Introduction
Compositional Mappings are the Simplest Bijections
Simpler Mappings are Learned Faster
Conlusion
Compositional Representation and Platonic Representation Hypothesis
The Underlying Assumption of the Ground-truth Generating Mechanism
Measuring Metrics: Kernel Alignment, Disentanglement, and Topological Similarity
The Converging Pressures
Coding Length and Topological Similarity for the Mappings in Toy256
Experimental Settings
More Experimental Results

Figures (5)

Figure 1: The compositional generalization problem (a, b) and two types of bijections (c, d).
Figure 2: The evidence and explanations of the claim that simpler mappings are learned faster. The blue arrows in the last panel mean when learning the given example, the model increases its confidence in the corresponding prediction. The increase of the darker arrows is stronger than that of lighter ones because the confidence changes for the lighter ones are indirectly caused by the "elasticity" of the neural network He2020elasticity. For holistic mapping, the yellow arrows pointing down mean the corresponding confidence decreases after learning the example.
Figure 3: Four typical mappings studied in this paper and their coding strings.
Figure 4: Experiments for the one-hot vector inputs.
Figure 5: Experiments for the vision inputs.

Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

TL;DR

Abstract

Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

Authors

TL;DR

Abstract

Table of Contents

Figures (5)