Table of Contents
Fetching ...

Discovering modular solutions that generalize compositionally

Simon Schug, Seijin Kobayashi, Yassir Akram, Maciej Wołczyk, Alexandra Proca, Johannes von Oswald, Razvan Pascanu, João Sacramento, Angelika Steger

TL;DR

A teacher-student setting with a modular teacher where the teacher has full control over the composition of ground truth modules is studied, and it is shown theoretically that identification up to linear transformation purely from demonstrations is possible without having to learn an exponential number of module combinations.

Abstract

Many complex tasks can be decomposed into simpler, independent parts. Discovering such underlying compositional structure has the potential to enable compositional generalization. Despite progress, our most powerful systems struggle to compose flexibly. It therefore seems natural to make models more modular to help capture the compositional nature of many tasks. However, it is unclear under which circumstances modular systems can discover hidden compositional structure. To shed light on this question, we study a teacher-student setting with a modular teacher where we have full control over the composition of ground truth modules. This allows us to relate the problem of compositional generalization to that of identification of the underlying modules. In particular we study modularity in hypernetworks representing a general class of multiplicative interactions. We show theoretically that identification up to linear transformation purely from demonstrations is possible without having to learn an exponential number of module combinations. We further demonstrate empirically that under the theoretically identified conditions, meta-learning from finite data can discover modular policies that generalize compositionally in a number of complex environments.

Discovering modular solutions that generalize compositionally

TL;DR

A teacher-student setting with a modular teacher where the teacher has full control over the composition of ground truth modules is studied, and it is shown theoretically that identification up to linear transformation purely from demonstrations is possible without having to learn an exponential number of module combinations.

Abstract

Many complex tasks can be decomposed into simpler, independent parts. Discovering such underlying compositional structure has the potential to enable compositional generalization. Despite progress, our most powerful systems struggle to compose flexibly. It therefore seems natural to make models more modular to help capture the compositional nature of many tasks. However, it is unclear under which circumstances modular systems can discover hidden compositional structure. To shed light on this question, we study a teacher-student setting with a modular teacher where we have full control over the composition of ground truth modules. This allows us to relate the problem of compositional generalization to that of identification of the underlying modules. In particular we study modularity in hypernetworks representing a general class of multiplicative interactions. We show theoretically that identification up to linear transformation purely from demonstrations is possible without having to learn an exponential number of module combinations. We further demonstrate empirically that under the theoretically identified conditions, meta-learning from finite data can discover modular policies that generalize compositionally in a number of complex environments.
Paper Structure (90 sections, 11 theorems, 55 equations, 15 figures, 12 tables, 2 algorithms)

This paper contains 90 sections, 11 theorems, 55 equations, 15 figures, 12 tables, 2 algorithms.

Key Result

Theorem 1

Assuming that $\mathcal{P}_{\bm{z}}$ has compositional and connected support, $\mathcal{P}_{\bm{x}}$ has full support in the input space and the student dimensions match that of the teacher, i.e. $\hat{M}=M\leq n, \hat{h}=h$, then under an additional smoothness condition on $\mathcal{P}_{\bm{z}}$ an i.e. if the student optimizes $(\hat{\Theta}, \hat{{\bm{a}}})$ to fit the teacher on $\mathcal{P}_{

Figures (15)

  • Figure 1: A Diagram contrasting a hypernetwork as a modular architecture to a linear combination of features as a monolithic architecture. B Toy example illustrating the modular teacher-student setting where tasks are drawn from parameter module compositions of a teacher hypernetwork.
  • Figure 2: A Schematic of the multi-task teacher-student setup. B Visualization of connected and disconnected task distributions in a teacher with $M=6$, $K=2$. Without the central node support would be disconnected. C Module alignment between the student and teacher is high for both continuous and discrete task distributions with connected but not disconnected support. D Module alignment is sensitive to overparameterization. Numbers denote the factor by which the student dimension is larger than the teacher. Error bars in C,D denote standard error of the mean for 3 seeds.
  • Figure 3: A In the hyperteacher we pick between 1 to $K$ of the $M$ teacher modules, adding them in parameter space to create a task. B ANIL and MAML fail to generalize to OOD tasks regardless of the support of the training distribution, while hypernetworks achieve good OOD accuracy only when the task support is compositional. C+D When the task distribution is compositional and connected the teacher modules can be linearly decoded from the student task embeddings. E Hypernetworks have high OOD accuracy for $K > 1$ but low OOD accuracy for $K=1$. F OOD performance of hypernetworks is sensitive to overparameterization in both the hidden dimension and module dimension. Error bars in B-F denote the standard error of the mean over 3 seeds.
  • Figure 4: A In the compositional preference grid world the agent has modular preferences over colors and gets a reward corresponding to the current preference for the color of an object. B Disconnecting the task support increases the OOD loss of hypernetworks. C Hypernetworks achieve better OOD loss than ANIL and MAML when the task support is compositional. D In the compositional goal grid world an agent has to walk to a target object and perform the correct target action. Goals are a composition of the maze, target object, target action and goal quadrant. E Hypernetworks achieve better OOD accuracy wrt to the optimal policy than ANIL and MAML. F When holding out one of the goal quadrants the OOD accuracy decreases for hypernetworks more strongly than for ANIL and MAML. Error bars in B,C,F denote the standard error of the mean over 3 seeds.
  • Figure A1: A No identification. Toy example of how correct identification of individual modules can lead to inconsistent neuron permutations preventing compositional generalization to unseen module compositions: Consider the simplified setting of a teacher and student network with two input neurons and three hidden neurons. Both the teacher and the student have $M=3$ modules. The upper right defines the teacher weights for each module. For instance the weights denoted by $W_b$ correspond to the weights connecting neuron 2 to the input for module 1. We now assume the student during training encounters three tasks. For each task exactly one of the teacher modules is used. Since in MLPs the ordering of neurons is permutation invariant, the student can perfectly match the teacher weights, even when it uses a different ordering of the neurons. As a result, the student modules can perfectly fit all three tasks, despite permuting the neuron indices. For instance, neuron 2 in module 3 of the student contains the weights $W_g$ whereas the corresponding neuron in the teacher contains the weights $W_h$. When we now present a new task during out-of-distribution evaluation that mixes two of the modules in the teacher, the student is required to mix the weights of each neuron across these two modules as well. Since the neuron permutations in the student are inconsistent with those of the teacher, the student is unable to create the correct neuron weights and ends up unable to generalize to the OOD task. B Linear identification. Toy example of how having connected support helps ensure that neuron permutations across modules are consistent allowing for compositional generalization to unseen module compositions: The teacher is setup identically to A. Different to before, the training distribution now has connected support, i.e. the binary masks defining the task families share a non-zero entry. After learning the neurons of the student are still permuted compared to the neurons of the teacher but this time the permutation is consistent across modules, i.e. compared to the teacher only rows are permuted. As a result, when presenting a novel task from a task family that mixes modules, the student is able to match each of the teacher neurons and therefore compositionally generalizes. While this example only shows a permutation of the learned student neurons consistent across modules, in general the student modules will be a linear transformation of the teacher modules, hence the naming linear identification. Importantly this linear transformation is consistent across modules given the conditions of Theorem \ref{['th:linear_identification_informal']} are satisfied.
  • ...and 10 more figures

Theorems & Definitions (21)

  • Definition 3.1: Compositional support
  • Definition 3.2: Connected support
  • Theorem 1: Compositional generalization, informal
  • Theorem 2: Linear identification, informal
  • Definition A.1
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • Theorem 5
  • ...and 11 more