Table of Contents
Fetching ...

Brain-Like Processing Pathways Form in Models With Heterogeneous Experts

Jack Cook, Danyal Akarca, Rui Ponte Costa, Jascha Achterberg

TL;DR

This work addresses how brain-like processing pathways emerge among heterogeneous neural regions by extending the HMoE architecture with three inductive biases: a routing-cost penalizing use of large experts, a task-performance–scaled routing penalty, and random expert dropout. These biases yield a Mixture-of-Pathways in which pathways are stable, self-sufficient, and task-discriminative, and whose dynamics resemble cortical–subcortical interactions observed in the brain, including the multiple-demand system during difficult tasks. Through extensive evaluation on 82 Mod-Cog time-series cognitive tasks, the authors demonstrate that the MoP model exhibits task difficulty–dependent pathway usage, dynamic reallocation during learning, and brain-like learning trajectories, distinguishing it from baseline HMoE. The findings offer a framework for neuroscience investigations into pathway formation and present a pathway-aware, resource-efficient approach for future MoE-based machine learning systems.

Abstract

The brain is made up of a vast set of heterogeneous regions that dynamically organize into pathways as a function of task demands. Examples of such pathways can be found in the interactions between cortical and subcortical networks during learning, or in sub-networks specializing for task characteristics such as difficulty or modality. Despite the large role these pathways play in cognition, the mechanisms through which brain regions organize into pathways remain unclear. In this work, we use an extension of the Heterogeneous Mixture-of-Experts architecture to show that heterogeneous regions do not form processing pathways by themselves, implying that the brain likely implements specific constraints which result in the reliable formation of pathways. We identify three biologically relevant inductive biases that encourage pathway formation: a routing cost imposed on the use of more complex regions, a scaling factor that reduces this cost when task performance is low, and randomized expert dropout. When comparing our resulting \textit{Mixture-of-Pathways} model with the brain, we observe that the artificial pathways in our model match how the brain uses cortical and subcortical systems to learn and solve tasks of varying difficulty. In summary, we introduce a novel framework for investigating how the brain forms task-specific pathways through inductive biases, and the effects these biases have on the behavior of Mixture-of-Experts models.

Brain-Like Processing Pathways Form in Models With Heterogeneous Experts

TL;DR

This work addresses how brain-like processing pathways emerge among heterogeneous neural regions by extending the HMoE architecture with three inductive biases: a routing-cost penalizing use of large experts, a task-performance–scaled routing penalty, and random expert dropout. These biases yield a Mixture-of-Pathways in which pathways are stable, self-sufficient, and task-discriminative, and whose dynamics resemble cortical–subcortical interactions observed in the brain, including the multiple-demand system during difficult tasks. Through extensive evaluation on 82 Mod-Cog time-series cognitive tasks, the authors demonstrate that the MoP model exhibits task difficulty–dependent pathway usage, dynamic reallocation during learning, and brain-like learning trajectories, distinguishing it from baseline HMoE. The findings offer a framework for neuroscience investigations into pathway formation and present a pathway-aware, resource-efficient approach for future MoE-based machine learning systems.

Abstract

The brain is made up of a vast set of heterogeneous regions that dynamically organize into pathways as a function of task demands. Examples of such pathways can be found in the interactions between cortical and subcortical networks during learning, or in sub-networks specializing for task characteristics such as difficulty or modality. Despite the large role these pathways play in cognition, the mechanisms through which brain regions organize into pathways remain unclear. In this work, we use an extension of the Heterogeneous Mixture-of-Experts architecture to show that heterogeneous regions do not form processing pathways by themselves, implying that the brain likely implements specific constraints which result in the reliable formation of pathways. We identify three biologically relevant inductive biases that encourage pathway formation: a routing cost imposed on the use of more complex regions, a scaling factor that reduces this cost when task performance is low, and randomized expert dropout. When comparing our resulting \textit{Mixture-of-Pathways} model with the brain, we observe that the artificial pathways in our model match how the brain uses cortical and subcortical systems to learn and solve tasks of varying difficulty. In summary, we introduce a novel framework for investigating how the brain forms task-specific pathways through inductive biases, and the effects these biases have on the behavior of Mixture-of-Experts models.

Paper Structure

This paper contains 30 sections, 9 equations, 16 figures, 1 table, 2 algorithms.

Figures (16)

  • Figure 1: Schematic of our baseline model architecture. Information is passed through three layers, each of which can dynamically route information to experts of different computational complexity.
  • Figure 2: Models trained with routing cost and task-based scaling exhibit more stable routing. Correlations are calculated between the routing patterns across 20 training runs of each model setup.
  • Figure 3: Model accuracy after removing low-weighted experts across different training dropout levels. If models trained without dropout ($\beta=0$) are prevented from using experts that contribute very little to the output, accuracy drops precipitously, from 98.2% to 15.1%. By comparison, the accuracy of models trained with a maximum dropout value of $\beta=0.8$ only drops from 86.5% to 77.7%.
  • Figure 4: Task clustering derived from expert usage patterns. Clusters are averaged over three phases of each task: the pre-stimulus phase, during which the task is known but no input data is presented, the stimulus phase, during which input data is presented, and the response phase. Left: Sizes of clusters across training runs. Right: Average routing weights across training runs by cluster (y-axis) and task phase. In each phase, expert usages within each layer (across each layer's skip connection, simple expert, and complex expert) sum to 100%. The complex expert is rarely used in the first two phases, during which the model outputs zero, but has an average usage as high as 11% in the most complex cluster (see \ref{['app:expert-usage']}). L1 / L2 / L3 = Layer 1 / Layer 2 / Layer 3.
  • Figure 5: Our model allocates less computation toward solving simpler tasks. Additionally, when the most complex experts in each layer are disabled, our model is still able to solve the simplest tasks with high accuracy.
  • ...and 11 more figures