Table of Contents
Fetching ...

Scaled Signed Averaging Improves In-Context and Early Learning Benchmark Performance in Small Transformers

Omar Naim, Swarnadeep Bhar, Jérôme Bolte, Nicholas Asher

TL;DR

The paper investigates why in-context learning in small transformers struggles with semantic and mathematical tasks, attributing this to the Softmax scoring in attention. It proposes SSA, a scalable signed averaging scoring function, as a flexible and saturating-aware alternative that can emulate exponential-like weighting while avoiding Softmax saturation. Across controlled ICL tasks involving quantifiers and linear functions and on standard NLP benchmarks, SSA improves generalization for both decoder-only and encoder-only models, outperforming Softmax baselines in many settings. While not a universal fix and constrained by hardware and data, SSA offers a practical, drop-in improvement for small transformers, expanding reliable ICL and NLP capabilities in resource-limited regimes.

Abstract

While Large Language models' abilities for in-context learning (ICL) have drawn much attention, we examine some of its limitations on semantic tasks involving quantifiers like "all" and "some", as well as on tasks with linear functions. We identify Softmax, the scoring function in attention mechanism, as a contributing factor to these limitations. We propose scaled signed averaging (SSA), a novel alternative to Softmax to mitigate these problems. We show that SSA significantly improves performance on our ICL tasks. In addition, SSA outperforms transformer models with Softmax on several early learning NLP benchmarks and linguistic probing tasks on zero and few-shot settings.

Scaled Signed Averaging Improves In-Context and Early Learning Benchmark Performance in Small Transformers

TL;DR

The paper investigates why in-context learning in small transformers struggles with semantic and mathematical tasks, attributing this to the Softmax scoring in attention. It proposes SSA, a scalable signed averaging scoring function, as a flexible and saturating-aware alternative that can emulate exponential-like weighting while avoiding Softmax saturation. Across controlled ICL tasks involving quantifiers and linear functions and on standard NLP benchmarks, SSA improves generalization for both decoder-only and encoder-only models, outperforming Softmax baselines in many settings. While not a universal fix and constrained by hardware and data, SSA offers a practical, drop-in improvement for small transformers, expanding reliable ICL and NLP capabilities in resource-limited regimes.

Abstract

While Large Language models' abilities for in-context learning (ICL) have drawn much attention, we examine some of its limitations on semantic tasks involving quantifiers like "all" and "some", as well as on tasks with linear functions. We identify Softmax, the scoring function in attention mechanism, as a contributing factor to these limitations. We propose scaled signed averaging (SSA), a novel alternative to Softmax to mitigate these problems. We show that SSA significantly improves performance on our ICL tasks. In addition, SSA outperforms transformer models with Softmax on several early learning NLP benchmarks and linguistic probing tasks on zero and few-shot settings.

Paper Structure

This paper contains 26 sections, 5 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Plots showing examples of boundary values for different models. (Left) Full transformer 12L8AH model tested on $f(x) = 9x$ and (Right) Transformer 12L8AH without MLP model trained on $D_{\cal I}= D_{\cal F}= {\cal N}(0,1)$ and tested on $f(x) = 10x$.
  • Figure 2: Attention maps for an ICL example for the task "every" of type $(x_1,f(x_1),x_2,f(x_2),...,x_n)$, where the query $x_n$ is a big value.
  • Figure 3: (Left) Comparison plot showing the evolution of MSE for SSA and Softmax-based models (12L8AH) with $D_{\cal F} , D_{\cal I} , D^{test}_{\cal I} \sim {\mathcal{N}}(0,1)$ and $D^{test}_{\cal F}$ and varying $D_{\cal F}^{test} \sim {\mathcal{N}}(0, \sigma)$. The heatmap shows the evolution of logarithm of MSE for the Softmax (Middle) and SSA (Right) model when varying both $D^{test}_{\cal I}$ and $D^{test}_{\cal F}$.
  • Figure 4: Heatmaps showing the evolution of errors for the 12L8AH model with Softmax (Left) and SSA (Right) on the "every" task. Model was trained on data in $D_{\cal I}={\mathcal{N}}(0,1)$ for lengths from 11 to 40 and tested in $D^{test}_{\cal I}={\mathcal{N}}(0,\sigma)$ for $\sigma \in \{1,...,10\}$ and lengths from 10 to 200 for each task. Yellow represents a much higher error rate than purple.
  • Figure 5: Figures showing plots for the base function $(1+b|x|)^{sgn(x)n}$ for SSA with $b=1$ and $n=\{1.1, 1.5, 2\}$
  • ...and 6 more figures