Table of Contents
Fetching ...

A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language

Ekdeep Singh Lubana, Kyogo Kawaguchi, Robert P. Dick, Hidenori Tanaka

TL;DR

This work empirically investigates an experimental system grounded in a context-sensitive formal language and finds that Transformers trained to perform tasks on top of strings from this language indeed exhibit emergent capabilities, and shows that once the language's underlying grammar and context-sensitivity inducing structures are learned by the model, performance on narrower tasks suddenly begins to improve.

Abstract

Increase in data, size, or compute can lead to sudden learning of specific capabilities by a neural network -- a phenomenon often called "emergence''. Beyond scientific understanding, establishing the causal factors underlying such emergent capabilities is crucial to enable risk regulation frameworks for AI. In this work, we seek inspiration from study of emergent properties in other fields and propose a phenomenological definition for the concept in the context of neural networks. Our definition implicates the acquisition of general structures underlying the data-generating process as a cause of sudden performance growth for specific, narrower tasks. We empirically investigate this definition by proposing an experimental system grounded in a context-sensitive formal language and find that Transformers trained to perform tasks on top of strings from this language indeed exhibit emergent capabilities. Specifically, we show that once the language's underlying grammar and context-sensitivity inducing structures are learned by the model, performance on narrower tasks suddenly begins to improve. We then analogize our network's learning dynamics with the process of percolation on a bipartite graph, establishing a formal phase transition model that predicts the shift in the point of emergence observed in our experiments when changing the data structure. Overall, our experimental and theoretical frameworks yield a step towards better defining, characterizing, and predicting emergence in neural networks.

A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language

TL;DR

This work empirically investigates an experimental system grounded in a context-sensitive formal language and finds that Transformers trained to perform tasks on top of strings from this language indeed exhibit emergent capabilities, and shows that once the language's underlying grammar and context-sensitivity inducing structures are learned by the model, performance on narrower tasks suddenly begins to improve.

Abstract

Increase in data, size, or compute can lead to sudden learning of specific capabilities by a neural network -- a phenomenon often called "emergence''. Beyond scientific understanding, establishing the causal factors underlying such emergent capabilities is crucial to enable risk regulation frameworks for AI. In this work, we seek inspiration from study of emergent properties in other fields and propose a phenomenological definition for the concept in the context of neural networks. Our definition implicates the acquisition of general structures underlying the data-generating process as a cause of sudden performance growth for specific, narrower tasks. We empirically investigate this definition by proposing an experimental system grounded in a context-sensitive formal language and find that Transformers trained to perform tasks on top of strings from this language indeed exhibit emergent capabilities. Specifically, we show that once the language's underlying grammar and context-sensitivity inducing structures are learned by the model, performance on narrower tasks suddenly begins to improve. We then analogize our network's learning dynamics with the process of percolation on a bipartite graph, establishing a formal phase transition model that predicts the shift in the point of emergence observed in our experiments when changing the data structure. Overall, our experimental and theoretical frameworks yield a step towards better defining, characterizing, and predicting emergence in neural networks.
Paper Structure (62 sections, 16 equations, 63 figures, 2 algorithms)

This paper contains 62 sections, 16 equations, 63 figures, 2 algorithms.

Figures (63)

  • Figure 1: Emergence as phases of learning. Emergence is a well-characterized phenomenon in natural sciences anderson1972morenewmanRandomGraphs2001newmanStructureFunctionComplex2003 and deeply entangled with the notion of phase changes, i.e., when change in some control variable (e.g., temperature) yields systematic changes in a system's underlying structure (e.g., formation of hexagonal configurations in a crystal) and simultaneously affects several of its properties. We argue for a similar characterization of emergence in machine learning: identifying systematic changes in a model's behavior that influence its downstream abilities and lead to sudden performance improvements. For example, learning a language's syntax will affect all downstream capabilities where coherent, grammatically correct generations are necessary.
  • Figure 2: Grammar and type constraints to define our formal language. (a) We use a PCFG to define our language's grammar (shown rules are examples; see App. \ref{['app:llhoods']} for precise details). The grammar's terminals are parts-of-speech from English and yield symbolic sentences that can be populated by tokens from the language's vocabulary. (b) Akin to natural language, wherein properties of an entity constrain sentences seen in a dataset corresponding to that entity, we define constraints (called type constraints) on our language that restrict which tokens can be seen together in a sentence. These constraints map entities to descriptive or relative properties, hence restricting which descriptive adjectives and verbs are valid for an entity. (c) Once a symbolic sentence is sampled from the grammar, we populate it with tokens from the language while respecting the type constraints. Training on string from this language in fact shows that the model deems sentences that do not respect type constraints to be extremely unlikely (see App. \ref{['app:language']}).
  • Figure 3: Task definitions. Our model is trained and evaluated on three types of tasks. (i) Free generation: the model generates sentences with correct grammar. (ii) Unscrambling: the model is provided with a set of words and must reorder them to form valid sentences. (iii) Conditional generation: model is given a set of entities or properties and must generate valid sentences using them. Note that examples in the figure are merely indicative. See App. \ref{['app:pcfgs']} for details.
  • Figure 4: Learning of structures drives emergent capabilities. For a detailed discussion, see main text. (a) Grammaticality and Type Check evaluations as a function of iterations or data (iterations$\times$batch-size). We see phases in the learning dynamics corresponding to emergent acquisition of structures underlying our language: grammar (black), relative type constraints (pink), and descriptive type constraints (green), shaded gray, pink, and green respectively. (b, c) Performance on Unscrambling and Conditional Generation. After a slight delay from phase boundaries, we see sudden improvements in the performance of individual tasks. (d) Learning curves. Loss also shows sudden changes at phase boundaries corresponding to the acquisition of structures. (e) Performance on descriptive/relative sentences. Decomposing by sentence type, we find a sublinear growth in descriptive type checks drives performance boost on descriptive sentences for the unscrambling task.
  • Figure 5: Effect of scaling number of descriptive properties. Scaling the number of descriptive properties in our language (see legend), we find relative type checks and unscrambling performance for relative sentences are essentially unaffected by the number of properties. Meanwhile, both descriptive type checks and unscrambling performance for descriptive sentences show a change in performance and delay in transition points. Despite these effects, we find the geometry of performance curves is extremely consistent; for descriptive type checks, this geometry indicates a memorization to generalization picture.
  • ...and 58 more figures

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5