Optical Transformers

Maxwell G. Anderson; Shi-Yuan Ma; Tianyu Wang; Logan G. Wright; Peter L. McMahon

Optical Transformers

Maxwell G. Anderson, Shi-Yuan Ma, Tianyu Wang, Logan G. Wright, Peter L. McMahon

TL;DR

This work investigates the feasibility of running Transformer operations on optical hardware to dramatically reduce energy consumption for large-scale models. By combining real optical experiments with calibrated simulations, the authors show that linear Transformer computations can operate under optical noise and imprecision, and they derive scaling laws indicating an energy cost per MAC that scales as $\frac{1}{d}$ with model width. Their results suggest substantial energy advantages over digital processors, potentially exceeding $100\times$ for current large models and reaching orders of magnitude beyond future quadrillion-parameter models, contingent on hardware scaling and low-precision strategies. The study also analyzes design implications for optical neural-network accelerators, including data-access costs, memory strategies, and architecture trends, and discusses a roadmap towards scalable, optics-based deep learning with Transformers. Overall, optical accelerators emerge as a plausible path to massively more energy-efficient inference for very large Transformers, provided future hardware can achieve the necessary density, speed, and memory integration.

Abstract

The rapidly increasing size of deep-learning models has caused renewed and growing interest in alternatives to digital computers to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for optical computing. To test this idea, we performed small-scale optical experiments with a prototype accelerator to demonstrate that Transformer operations can run on optical hardware despite noise and errors. Using simulations, validated by our experiments, we then explored the energy efficiency of optical implementations of Transformers and identified scaling laws for model performance with respect to optical energy usage. We found that the optical energy per multiply-accumulate (MAC) scales as $\frac{1}{d}$ where $d$ is the Transformer width, an asymptotic advantage over digital systems. We conclude that with well-engineered, large-scale optical hardware, it may be possible to achieve a $100 \times$ energy-efficiency advantage for running some of the largest current Transformer models, and that if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical computers could have a $>8,000\times$ energy-efficiency advantage over state-of-the-art digital-electronic processors that achieve 300 fJ/MAC. We analyzed how these results motivate and inform the construction of future optical accelerators along with optics-amenable deep-learning approaches. With assumptions about future improvements to electronics and Transformer quantization techniques (5$\times$ cheaper memory access, double the digital--analog conversion efficiency, and 4-bit precision), we estimated that optical computers' advantage against current 300-fJ/MAC digital processors could grow to $>100,000\times$.

Optical Transformers

TL;DR

with model width. Their results suggest substantial energy advantages over digital processors, potentially exceeding

for current large models and reaching orders of magnitude beyond future quadrillion-parameter models, contingent on hardware scaling and low-precision strategies. The study also analyzes design implications for optical neural-network accelerators, including data-access costs, memory strategies, and architecture trends, and discusses a roadmap towards scalable, optics-based deep learning with Transformers. Overall, optical accelerators emerge as a plausible path to massively more energy-efficient inference for very large Transformers, provided future hardware can achieve the necessary density, speed, and memory integration.

Abstract

where

is the Transformer width, an asymptotic advantage over digital systems. We conclude that with well-engineered, large-scale optical hardware, it may be possible to achieve a

energy-efficiency advantage for running some of the largest current Transformer models, and that if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical computers could have a

energy-efficiency advantage over state-of-the-art digital-electronic processors that achieve 300 fJ/MAC. We analyzed how these results motivate and inform the construction of future optical accelerators along with optics-amenable deep-learning approaches. With assumptions about future improvements to electronics and Transformer quantization techniques (5

cheaper memory access, double the digital--analog conversion efficiency, and 4-bit precision), we estimated that optical computers' advantage against current 300-fJ/MAC digital processors could grow to

Paper Structure (42 sections, 6 equations, 13 figures, 7 tables)

This paper contains 42 sections, 6 equations, 13 figures, 7 tables.

Introduction
Background and Related Work
Transformer Models
Optical Accelerators
Acceleration of Linear Operations
Device Imprecision and Optical Shot Noise
Efficient Photon Usage
Optical Neural Network Energy Costs
Streaming Weights Versus Weights-In-Place
Previous Optical Neural Network Architectures
Optical Transformers
Architecture and Task
Transformer Computations on Optical Hardware
Simulation of Transformers on Optical Hardware
Hybrid Scheme
...and 27 more sections

Figures (13)

Figure 1: General scheme of an optical neural network (ONN) accelerator. Data is encoded and fed into the network, and the output is subject to shot noise. There are many experimental realizations of ONN accelerators such as Mach-Zehnder Interferometer meshes shen2017deepbogaerts2020programmable, crossbar arrays mrr_crossbarsfeldmann2020parallel, and wavelength-multiplexed micro-ring weight banks Tait:15. In this work, we adopt the free-space multiplier Wang2022spall2020fullyhayasaki1992optical (top right) to demonstrate Transformer operations in optical experiments and as an example for our simulations, but our findings about Transformers on optical systems apply broadly to many optical-accelerator architectures, including those depicted in the inset.
Figure 2: Optical Transformer evaluation: prototype hardware; simulator model; Transformer architecture. Bottom: typical Transformer architecture, but with ReLU6 activation. Top Left: experimental spatial light modulator (SLM)-based accelerator setup. From some layers---marked with a laser icon---we sampled dot products to run on real hardware. Top Middle: Linear operations, in light blue, run on a simulated accelerator with noise/error. Lookup tables (LUT) allow simulation using our setup's supported weight/activation values. Top right: our model of energy consumption for optical accelerators, based on assumptions and results from our experiment/simulations. The model accelerator system consists of random-access memory (RAM), a digital--analog converter (DAC), light modulation (MOD), amplification (AMP), an analog--digital converter (ADC), and an optical component that performs the computations.
Figure 3: Comparison of experimental noise and simulated Optical Transformer noise tolerance. Top: Simulated performance (Wikitext-103 validation perplexity (PPL), shown as contours) versus percent mean-relative noise in feed-forward (FF) and attention (Attn) layers. Noise levels from experimental data marked with a star for dot products sampled from first and last Transformer encoder layers. Bottom: comparison of simulated noise model to error from experimental data. The Gaussian shape of the simulated noise models the experimental errors accurately.
Figure 4: Simulations of Optical Transformer behavior with varying photon usage. Left: Wikitext-103 validation-set perplexity (PPL) versus embedding dimension $d$ and total photons used for a single inference (predicting next token in language modelling, or processing one sequence in a classification task). 8-bit digital model performance is shown with dashed lines. With sufficiently large numbers of photons, optical hardware can achieve the same perplexity as digital-electronic hardware, under the assumption that the optical hardware's precision is limited by photon shot noise. Middle: Percent change in perplexity from the perplexity achieved when using $10^4$ photons per multiply-accumulate (MAC), versus photons-per-MAC. At $10^4$ photons-per-MAC, the perplexity is approximately as good as one can achieve, so this plot shows how the perplexity degrades from ideal as one uses fewer photons-per-MAC; the plot exhibits truncated power-law scaling. Right: Scaling of number of photons needed for an Optical Transformer to achieve the same perplexity as an 8-bit digital-electronic processor, versus model size.
Figure 5: Estimated energy usage of Transformer models on optical hardware for a single inference (predicting the next token in language modelling, or processing one sequence in a classification task). Hypothetical future model designs are labelled FUTURE-*. Estimated energy/MAC for digital systems is based on reuthersurvey2020. Trend for energy usage in optical systems (blue) computed based on real models only. Inset: energy advantage of running on optics over estimated NVIDIA A100 usage. The advantage grows with the model compute. $\mathrm{M}=10^6$, $\mathrm{G}=10^9$, $\mathrm{T}=10^{12}$, $\mathrm{q}=10^{15}$ parameters.
...and 8 more figures

Optical Transformers

TL;DR

Abstract

Optical Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (13)