Table of Contents
Fetching ...

On-Chip Learning via Transformer In-Context Learning

Jan Finkbeiner, Emre Neftci

TL;DR

This work reframes self-attention in autoregressive decoder-only transformers as a local plasticity process and implements it on the Loihi 2 neuromorphic chip to enable on-chip, inference-time adaptation. By treating KV-cache construction as two- and three-factor local learning rules, the approach achieves on-chip weight updates via Loihi's learning engine while performing token-by-token autoregressive inference. The authors demonstrate few-shot in-context learning on Omniglot with a simple decoder-only transformer across multiple hardware variants (Float, Quant, Lava, Loihi), showing competitive performance relative to gradient-based methods and highlighting the potential for lifelong on-device learning. The results support a closer integration of scalable transformer architectures with neuromorphic hardware to enable efficient, hardware-friendly on-chip learning and adaptation.

Abstract

Autoregressive decoder-only transformers have become key components for scalable sequence processing and generation models. However, the transformer's self-attention mechanism requires transferring prior token projections from the main memory at each time step (token), thus severely limiting their performance on conventional processors. Self-attention can be viewed as a dynamic feed-forward layer, whose matrix is input sequence-dependent similarly to the result of local synaptic plasticity. Using this insight, we present a neuromorphic decoder-only transformer model that utilizes an on-chip plasticity processor to compute self-attention. Interestingly, the training of transformers enables them to ``learn'' the input context during inference. We demonstrate this in-context learning ability of transformers on the Loihi 2 processor by solving a few-shot classification problem. With this we emphasize the importance of pretrained models especially their ability to find simple, local, backpropagation free, learning rules enabling on-chip learning and adaptation in a hardware friendly manner.

On-Chip Learning via Transformer In-Context Learning

TL;DR

This work reframes self-attention in autoregressive decoder-only transformers as a local plasticity process and implements it on the Loihi 2 neuromorphic chip to enable on-chip, inference-time adaptation. By treating KV-cache construction as two- and three-factor local learning rules, the approach achieves on-chip weight updates via Loihi's learning engine while performing token-by-token autoregressive inference. The authors demonstrate few-shot in-context learning on Omniglot with a simple decoder-only transformer across multiple hardware variants (Float, Quant, Lava, Loihi), showing competitive performance relative to gradient-based methods and highlighting the potential for lifelong on-device learning. The results support a closer integration of scalable transformer architectures with neuromorphic hardware to enable efficient, hardware-friendly on-chip learning and adaptation.

Abstract

Autoregressive decoder-only transformers have become key components for scalable sequence processing and generation models. However, the transformer's self-attention mechanism requires transferring prior token projections from the main memory at each time step (token), thus severely limiting their performance on conventional processors. Self-attention can be viewed as a dynamic feed-forward layer, whose matrix is input sequence-dependent similarly to the result of local synaptic plasticity. Using this insight, we present a neuromorphic decoder-only transformer model that utilizes an on-chip plasticity processor to compute self-attention. Interestingly, the training of transformers enables them to ``learn'' the input context during inference. We demonstrate this in-context learning ability of transformers on the Loihi 2 processor by solving a few-shot classification problem. With this we emphasize the importance of pretrained models especially their ability to find simple, local, backpropagation free, learning rules enabling on-chip learning and adaptation in a hardware friendly manner.

Paper Structure

This paper contains 8 sections, 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Illustration of an Autoregressive Decoder-Only Transformer layer.
  • Figure 2: Illustration of the problem setup for the $N$-way $K$-shot few-shot learning experiment. Following prior work Vinyals_etal16_matcnetw, a support set of $N$ image samples is associated with $K$ arbitrary labels. In our experiments, we used the Omniglot dataset Lake_etal15_humaconc instead of dog breeds.
  • Figure 3: Illustration demonstrating the interpretation of the self-attention mechanism as local plasticity rule.