Scaled Signed Averaging Improves In-Context and Early Learning Benchmark Performance in Small Transformers
Omar Naim, Swarnadeep Bhar, Jérôme Bolte, Nicholas Asher
TL;DR
The paper investigates why in-context learning in small transformers struggles with semantic and mathematical tasks, attributing this to the Softmax scoring in attention. It proposes SSA, a scalable signed averaging scoring function, as a flexible and saturating-aware alternative that can emulate exponential-like weighting while avoiding Softmax saturation. Across controlled ICL tasks involving quantifiers and linear functions and on standard NLP benchmarks, SSA improves generalization for both decoder-only and encoder-only models, outperforming Softmax baselines in many settings. While not a universal fix and constrained by hardware and data, SSA offers a practical, drop-in improvement for small transformers, expanding reliable ICL and NLP capabilities in resource-limited regimes.
Abstract
While Large Language models' abilities for in-context learning (ICL) have drawn much attention, we examine some of its limitations on semantic tasks involving quantifiers like "all" and "some", as well as on tasks with linear functions. We identify Softmax, the scoring function in attention mechanism, as a contributing factor to these limitations. We propose scaled signed averaging (SSA), a novel alternative to Softmax to mitigate these problems. We show that SSA significantly improves performance on our ICL tasks. In addition, SSA outperforms transformer models with Softmax on several early learning NLP benchmarks and linguistic probing tasks on zero and few-shot settings.
