Table of Contents
Fetching ...

STAR: Synthesis of Tailored Architectures

Armin W. Thomas, Rom Parnichkun, Alexander Amini, Stefano Massaroli, Michael Poli

TL;DR

This work proposes a new approach for the synthesis of tailored architectures (STAR), combining a novel search space based on the theory of linear input-varying systems, supporting a hierarchical numerical encoding into architecture genomes.

Abstract

Iterative improvement of model architectures is fundamental to deep learning: Transformers first enabled scaling, and recent advances in model hybridization have pushed the quality-efficiency frontier. However, optimizing architectures remains challenging and expensive. Current automated or manual approaches fall short, largely due to limited progress in the design of search spaces and due to the simplicity of resulting patterns and heuristics. In this work, we propose a new approach for the synthesis of tailored architectures (STAR). Our approach combines a novel search space based on the theory of linear input-varying systems, supporting a hierarchical numerical encoding into architecture genomes. STAR genomes are automatically refined and recombined with gradient-free, evolutionary algorithms to optimize for multiple model quality and efficiency metrics. Using STAR, we optimize large populations of new architectures, leveraging diverse computational units and interconnection patterns, improving over highly-optimized Transformers and striped hybrid models on the frontier of quality, parameter size, and inference cache for autoregressive language modeling.

STAR: Synthesis of Tailored Architectures

TL;DR

This work proposes a new approach for the synthesis of tailored architectures (STAR), combining a novel search space based on the theory of linear input-varying systems, supporting a hierarchical numerical encoding into architecture genomes.

Abstract

Iterative improvement of model architectures is fundamental to deep learning: Transformers first enabled scaling, and recent advances in model hybridization have pushed the quality-efficiency frontier. However, optimizing architectures remains challenging and expensive. Current automated or manual approaches fall short, largely due to limited progress in the design of search spaces and due to the simplicity of resulting patterns and heuristics. In this work, we propose a new approach for the synthesis of tailored architectures (STAR). Our approach combines a novel search space based on the theory of linear input-varying systems, supporting a hierarchical numerical encoding into architecture genomes. STAR genomes are automatically refined and recombined with gradient-free, evolutionary algorithms to optimize for multiple model quality and efficiency metrics. Using STAR, we optimize large populations of new architectures, leveraging diverse computational units and interconnection patterns, improving over highly-optimized Transformers and striped hybrid models on the frontier of quality, parameter size, and inference cache for autoregressive language modeling.

Paper Structure

This paper contains 66 sections, 2 equations, 46 figures, 6 tables.

Figures (46)

  • Figure 1.1: [Top Left]: Population of architectures undergoing iterative STAR evolution to minimize number of parameters and maximize quality. [Top Right:] Baseline Transformer++, hybrid model, and representative architecture found via STAR. [Bottom]:STAR evolution optimizes architectures using principles of evolutionary optimization, including assessment, recombination, and mutation.
  • Figure 3.1: Hierarchical structure of the STAR genome. Each sequence at lower levels is summarized into a single value at higher levels, enabling its treatment as a discrete variable. We leverage this property extensively when optimizing backbones directly.
  • Figure 4.1: Fundamental operations of STAR evolution (akin to other evolutionary optimization algorithms).
  • Figure 4.2: Training perplexity for all runs during STAR evolution of a population.
  • Figure 5.1: Evolutionary algorithms: Final populations evolved with the Firefly Algorithm (FA), Genetic Algorithm (GA), and Non-dominated Sorting Genetic Algorithm II (NSGA-2).
  • ...and 41 more figures