Table of Contents
Fetching ...

Conditional computation in neural networks: principles and research trends

Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi

TL;DR

The paper tackles how neural networks can adapt computation to inputs by enabling dynamic activation of tokens, layers, and sub-modules through a unifying sparse-modularity framework. It presents three concrete implementations—mixture-of-experts, token selection, and early-exit networks—grounded in a differentiable sampling approach via the Gumbel-Softmax. It analyzes benefits in efficiency, specialization, and explainability, and discusses emerging applications such as scientific discovery and semantic communication. The work further outlines open challenges in global routing, adaptive budget control, and the development of benchmarks to evaluate specialization and transfer, shaping a roadmap for future research in conditional computation.

Abstract

This article summarizes principles and ideas from the emerging area of applying \textit{conditional computation} methods to the design of neural networks. In particular, we focus on neural networks that can dynamically activate or de-activate parts of their computational graph conditionally on their input. Examples include the dynamic selection of, e.g., input tokens, layers (or sets of layers), and sub-modules inside each layer (e.g., channels in a convolutional filter). We first provide a general formalism to describe these techniques in an uniform way. Then, we introduce three notable implementations of these principles: mixture-of-experts (MoEs) networks, token selection mechanisms, and early-exit neural networks. The paper aims to provide a tutorial-like introduction to this growing field. To this end, we analyze the benefits of these modular designs in terms of efficiency, explainability, and transfer learning, with a focus on emerging applicative areas ranging from automated scientific discovery to semantic communication.

Conditional computation in neural networks: principles and research trends

TL;DR

The paper tackles how neural networks can adapt computation to inputs by enabling dynamic activation of tokens, layers, and sub-modules through a unifying sparse-modularity framework. It presents three concrete implementations—mixture-of-experts, token selection, and early-exit networks—grounded in a differentiable sampling approach via the Gumbel-Softmax. It analyzes benefits in efficiency, specialization, and explainability, and discusses emerging applications such as scientific discovery and semantic communication. The work further outlines open challenges in global routing, adaptive budget control, and the development of benchmarks to evaluate specialization and transfer, shaping a roadmap for future research in conditional computation.

Abstract

This article summarizes principles and ideas from the emerging area of applying \textit{conditional computation} methods to the design of neural networks. In particular, we focus on neural networks that can dynamically activate or de-activate parts of their computational graph conditionally on their input. Examples include the dynamic selection of, e.g., input tokens, layers (or sets of layers), and sub-modules inside each layer (e.g., channels in a convolutional filter). We first provide a general formalism to describe these techniques in an uniform way. Then, we introduce three notable implementations of these principles: mixture-of-experts (MoEs) networks, token selection mechanisms, and early-exit neural networks. The paper aims to provide a tutorial-like introduction to this growing field. To this end, we analyze the benefits of these modular designs in terms of efficiency, explainability, and transfer learning, with a focus on emerging applicative areas ranging from automated scientific discovery to semantic communication.
Paper Structure (15 sections, 20 equations, 5 figures, 1 table)

This paper contains 15 sections, 20 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Input sparsity: A differentiable mechanism subsamples input tokens to be processed by the later parts of the network (we show original image patches in the figure, but the tokens can be equivalently be replaced by their latent representations if we consider an intermediate layer of the architecture).
  • Figure 2: Width sparsity: Different parts of a layer (e.g., experts) can be activated based on the conditioning value.
  • Figure 3: Depth sparsity: A subset of the layers can be deactivated by the sampling mechanism, like in early-exit networks.
  • Figure 4: Overview of the Gumbel-Softmax trick. We show in orange differentiable operations (the argmax's gradient being zero almost everywhere) and with a dashed arrow the relaxed backward path.
  • Figure 5: Example of accuracy-flops trade-off for inference with A-ViT yin2022vit. Specifically, the architecture is a DeiT, trained on Imagenette.