Table of Contents
Fetching ...

Streamlined optical training of large-scale modern deep learning architectures with direct feedback alignment

Ziao Wang, Kilian Müller, Matthew Filipovich, Julien Launay, Ruben Ohana, Gustave Pariente, Safa Mokaadi, Charles Brossollet, Fabien Moreau, Alessandro Cappelli, Iacopo Poli, Igor Carron, Laurent Daudet, Florent Krzakala, Sylvain Gigan

TL;DR

This work tackles the training bottleneck of backpropagation by combining direct feedback alignment (DFA) with an optical processing unit (OPU) that performs large-scale random projections. The central approach, ODFA, encodes error signals optically via $ oldsymbol{s}^{(l)} = oldsymbol{T}^{(l)} oldsymbol{e}$ and updates network parameters in parallel, achieving energy-efficient operations up to $ ext{O}(10^3)$ TeraOPS under low power. The authors demonstrate ODFA across language, vision, and diffusion tasks, including Transformers with over $10^9$ parameters, climate models, and diffusion models, showing competitive performance and clear scalability advantages at extreme model sizes, especially when memory offloading is used. Overall, the results reveal a promising route to sustain AI growth beyond traditional von Neumann limits by pairing physics-informed hardware with compatible learning algorithms, potentially enabling ultra-deep models with lower energy footprints.

Abstract

Modern deep learning relies nearly exclusively on dedicated electronic hardware accelerators. Photonic approaches, with low consumption and high operation speed, are increasingly considered for inference but, to date, remain mostly limited to relatively basic tasks. Simultaneously, the problem of training deep and complex neural networks, overwhelmingly performed through backpropagation, remains a significant limitation to the size and, consequently, the performance of current architectures and a major compute and energy bottleneck. Here, we experimentally implement a versatile and scalable training algorithm, called direct feedback alignment, on a hybrid electronic-photonic platform. An optical processing unit performs large-scale random matrix multiplications, which is the central operation of this algorithm, at speeds up to 1500 TeraOPS under 30 Watts of power. We perform optical training of modern deep learning architectures, including Transformers, with more than 1B parameters, and obtain good performances on language, vision, and diffusion-based generative tasks. We study the scaling of the training time, and demonstrate a potential advantage of our hybrid opto-electronic approach for ultra-deep and wide neural networks, thus opening a promising route to sustain the exponential growth of modern artificial intelligence beyond traditional von Neumann approaches.

Streamlined optical training of large-scale modern deep learning architectures with direct feedback alignment

TL;DR

This work tackles the training bottleneck of backpropagation by combining direct feedback alignment (DFA) with an optical processing unit (OPU) that performs large-scale random projections. The central approach, ODFA, encodes error signals optically via and updates network parameters in parallel, achieving energy-efficient operations up to TeraOPS under low power. The authors demonstrate ODFA across language, vision, and diffusion tasks, including Transformers with over parameters, climate models, and diffusion models, showing competitive performance and clear scalability advantages at extreme model sizes, especially when memory offloading is used. Overall, the results reveal a promising route to sustain AI growth beyond traditional von Neumann limits by pairing physics-informed hardware with compatible learning algorithms, potentially enabling ultra-deep models with lower energy footprints.

Abstract

Modern deep learning relies nearly exclusively on dedicated electronic hardware accelerators. Photonic approaches, with low consumption and high operation speed, are increasingly considered for inference but, to date, remain mostly limited to relatively basic tasks. Simultaneously, the problem of training deep and complex neural networks, overwhelmingly performed through backpropagation, remains a significant limitation to the size and, consequently, the performance of current architectures and a major compute and energy bottleneck. Here, we experimentally implement a versatile and scalable training algorithm, called direct feedback alignment, on a hybrid electronic-photonic platform. An optical processing unit performs large-scale random matrix multiplications, which is the central operation of this algorithm, at speeds up to 1500 TeraOPS under 30 Watts of power. We perform optical training of modern deep learning architectures, including Transformers, with more than 1B parameters, and obtain good performances on language, vision, and diffusion-based generative tasks. We study the scaling of the training time, and demonstrate a potential advantage of our hybrid opto-electronic approach for ultra-deep and wide neural networks, thus opening a promising route to sustain the exponential growth of modern artificial intelligence beyond traditional von Neumann approaches.
Paper Structure (33 sections, 11 equations, 27 figures)

This paper contains 33 sections, 11 equations, 27 figures.

Figures (27)

  • Figure 1: Overview of the direct feedback alignment (DFA) algorithm and Optical Processing Unit (OPU).a, Concept of the direct feedback error propagation. Back-propagation (BP) transmits the error $\vec{e}$ sequentially from the final to the first layer, while DFA distributes error signals in parallel via random projections. b, Illustration of the OPU. Coherent laser light illuminates a DMD, then propagates through a strongly scattering medium before being captured by a camera. The error vector $\vec{e}$ is ternarized and encoded as binary pixels on the DMD, subsequently propagating through the complex diffusive medium to effectively perform $\mathbf{T}\vec{e}$, where $\mathbf{T}$ is a fixed, large, complex Gaussian random matrix. While the camera records an intensity pattern $|\mathbf{T}\vec{e}|^2$ that depends non-linearly on $\vec{e}$, linear random projections can be recovered thanks to the encoding strategy (see ref. ohana2023linear and SI Note \ref{['sec:dfaodfa']}). c, Rack-mount cabinet for OPUs and computers. A custom enclosure houses OPUs and a computer for simultaneous operations, designed as a plug-and-play solution. Blueprint inset shows the OPU's internal layout with a $1\text{cm}$ scale. The OPU interfaces with the computer through Python libraries and is compatible with NumPy and PyTorch (see SI Note \ref{['sec:expexp']}).
  • Figure 2: Optical training of a generative Transformer for Movie-Dialogs dataset.a, Schematic representation of our Optical DFA (ODFA) training algorithm. The gradient vector $\vec{e}_p$ from the last layer is multiplied by a random matrix $\mathbf{T}$ and sent to each decoder block other than the last one in parallel, while local backprop is applied within each block, with no gradient communication among blocks. b, Loss trajectories of the generative Transformers trained using various methods. All methods employed the same architecture, but the first four utilized the same ODFA-adopted training configuration. BP$^*$ applied a distinct one where backpropagation can reach the best performance. SHLW trained the last decoder layer solely using backpropagation, with the other parameters frozen (see SI Note \ref{['sec:llmconfig']}). c, Examples of text generated by the ODFA-trained Transformer at $0\%$, $20\%$, and $100\%$ training stages respectively. A bent arrow represents the generated newline token. d Mean causal attention for a given prompt (at $100\%$ training). We used causal attention, meaning that the tokens in the sequence only incorporate themselves and the previous tokens. The translucent words indicate that the Transformer has not yet processed the tokens. Shades of orange represent the attention weight on certain words.
  • Figure 3: ODFA training of ViT and FCNN on the high-dimensional climate projection dataset.a, Schematic illustration of BP/ODFA-trained ViT/FCNN on the climate projection dataset. The inputs are the global distributions of four forcing factors (e.g., carbon dioxide) from 2015 to 2100, with a dimension of $55.3$k. The outputs are the global distributions of surface air temperature at specific years, with a dimension of $13.8$k. The dataset contains both historical recordings for past years and simulation results based on Earth system models for future years. One ViT with 134M parameters and one FCNN with 1.3B parameters were trained both using BP and ODFA. b, Ground truth and predictions of ODFA-trained ViT/FCNN on the temperature change in Year 2100. c, Performances of BP/ODFA-trained models over two RMSE-based metrics. The BP results of ViT(Reproduce) were obtained using the same configuration as ViT(ODFA), and the results of ConvNN are from Watson-Parris2022 (see SI Note \ref{['sec:vitmetric']}). The ViT benchmark (dashed red line) corresponds to the optimal training of the ViT using a much larger dataset Nguyen2023. d, Absolute error between the targets and the BP/ODFA-trained ViT predictions respectively. Both present a similar error level.
  • Figure 4: Scaling of the training time per sample for ODFA-trained extreme-scale FCNNs.a, Training time of FCNNs versus hidden layer size ($100$ to $3080$ neurons per hidden layer). Dot-solid lines represent training using ODFA; cross-dashed lines are for BP. Both follow the expected quadratic behavior. Curve colors indicate hidden layer count: yellow for shallow, purple for deep FCNNs. At the rightmost configuration ($96$ layers, $3080$ neurons each), BP ($13.39$ ms/sample) is slower than ODFA ($13.09$ ms/sample), reaching the GPU's memory limit. Inset highlights more data around this configuration. b, Training time versus number of layers ($1$ to $96$). Curve colors indicate hidden layer size: yellow for narrow, purple for wide FCNNs. Training time increases linearly with layer count. c, Extended comparison using offloading technique to overcome GPU memory limitations, along hidden layer sizes ($100$ to $5200$), up to 2.7 billion parameters. Curve colors indicate hidden layer count: red for shallow, blue for deep FCNNs. Both BP and ODFA employ the same offloading strategy. ODFA presents a sustained speed advantage over BP at extra-large scales. Inset indicates the time difference. d, Ratio of training time: ODFA (GPU-OPU) vs. DFA (GPU only). The time difference narrows at higher dimensions until the GPU memory limit is reached.
  • Figure S1: Performance of DFA and TDFA on MNIST DFA and TDFA were used to train a three-layer neural network on the full MNIST dataset with the following layer dimensions: $[784, 100, 10]$. The batch size was $100$, and the model was trained over $50$ epochs. TDFA encodes the gradient vectors into three values, $[-1, 0, 1]$. The results were obtained by $20$ runs with the same configuration. Left, Validation loss along the training process by DFA and TDFA. The mean is the mean validation loss over the $20$ models with the same configuration and the same for the range. Right, Test accuracy along the training process by DFA and TDFA. A zoom-in inset shows the test accuracy after $30$ epochs.
  • ...and 22 more figures