Table of Contents
Fetching ...

The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization

Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber

TL;DR

The paper introduces the Neural Data Router (NDR), a Transformer-based architecture augmented with a copy gate and geometric attention to enable adaptive control flow and data routing across Transformer columns. By allowing layers to be skipped when inputs are ready and by biasing attention to the closest matching signals, NDR achieves robust length and depth generalization on CTL, simple arithmetic, and ListOps tasks. Key findings include 100% generalization on CTL, near-perfect results on arithmetic and ListOps, and interpretable gating/attention patterns that align with intuitive neural routing. The work argues for a bottom-up, architecture-driven approach to generalization and provides code to enable replication and further exploration of data routing in transformer networks.

Abstract

Despite progress across a broad range of applications, Transformers have limited success in systematic generalization. The situation is especially frustrating in the case of algorithmic tasks, where they often fail to find intuitive solutions that route relevant information to the right node/operation at the right time in the grid represented by Transformer columns. To facilitate the learning of useful control flow, we propose two modifications to the Transformer architecture, copy gate and geometric attention. Our novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task, as well as near-perfect accuracy on the simple arithmetic task and a new variant of ListOps testing for generalization across computational depths. NDR's attention and gating patterns tend to be interpretable as an intuitive form of neural routing. Our code is public.

The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization

TL;DR

The paper introduces the Neural Data Router (NDR), a Transformer-based architecture augmented with a copy gate and geometric attention to enable adaptive control flow and data routing across Transformer columns. By allowing layers to be skipped when inputs are ready and by biasing attention to the closest matching signals, NDR achieves robust length and depth generalization on CTL, simple arithmetic, and ListOps tasks. Key findings include 100% generalization on CTL, near-perfect results on arithmetic and ListOps, and interpretable gating/attention patterns that align with intuitive neural routing. The work argues for a bottom-up, architecture-driven approach to generalization and provides code to enable replication and further exploration of data routing in transformer networks.

Abstract

Despite progress across a broad range of applications, Transformers have limited success in systematic generalization. The situation is especially frustrating in the case of algorithmic tasks, where they often fail to find intuitive solutions that route relevant information to the right node/operation at the right time in the grid represented by Transformer columns. To facilitate the learning of useful control flow, we propose two modifications to the Transformer architecture, copy gate and geometric attention. Our novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task, as well as near-perfect accuracy on the simple arithmetic task and a new variant of ListOps testing for generalization across computational depths. NDR's attention and gating patterns tend to be interpretable as an intuitive form of neural routing. Our code is public.

Paper Structure

This paper contains 42 sections, 13 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Left: an ideal sequence of computations in a Transformer for an arithmetic expression. Right: ordering (numbers in the grid) of source positions used in geometric attention (Eq. \ref{['eq:order']}; $N=5$).
  • Figure 2: Example visualization of NDR. For other models, see Appendix \ref{['app:visual']}. Top: Attention map for different steps. The x/y-axis corresponds to source/target positions, respectively. Each position focuses on the column to the right, except the last one where the result is read from, which focuses on the last operation. The focus becomes clear only once the result is available. Bottom: gate activations for different steps/layers. The gates remain closed until the data dependencies are satisfied.
  • Figure 3: Example visualization of NDR on ListOps. The top row shows head 13 in different steps, which controls which arguments are used in which step. The bottom row shows different heads in different key steps. Please refer to Sec. \ref{['sec:analysis']} for the step-by-step description. More visualizations are provided in the appendix: Fig. \ref{['appendix:fig:all_attention_geom_listops']} shows the max of attention over all heads for all steps, Fig. \ref{['appendix:fig:all_attention_geom_listops_h13']} shows all steps of head 13, and Fig. \ref{['appendix:fig:all_gates_geom_listops']} shows the corresponding gates.
  • Figure 4: Average number of steps/layers for different sequence lengths on the compositional table lookup task for the Transformer with relative positional encodings and the ACT variant described in Appendix \ref{['app:abla']}. The red line shows $T_\text{max}=14$. Note that the sequence length shown here includes the begin and end tokens. Thus, the sequence length of 4 corresponds to one function application (3 for the identity function i.e. no function is applied).
  • Figure 5: Structure of Transformer/NDR layer with a copy gate (Sec. \ref{['sec:gating']}). The blue part corresponds to the standard Transformer, except for the missing residual connection around the feedforward block ("FF: Update"). The gray part is the copy gate. The feedforward part corresponding to the gate is usually significantly smaller than the one used for the update.
  • ...and 9 more figures