Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
Yingcong Li, Davoud Ataee Tarzanagh, Ankit Singh Rawat, Maryam Fazel, Samet Oymak
TL;DR
This work probes how gating in Gated Linear Attention (GLA) shapes in-context learning by connecting gating-induced weights to Weighted Preconditioned Gradient Descent (WPGD) under multitask prompts. It proves that multilayer GLA can implement data-dependent WPGD, and introduces a multitask data model to analyze the optimization landscape, establishing existence and (up to scaling) uniqueness of a global WPGD minimum under mild conditions. The study derives and validates results showing gating fosters context-aware weighting that can outperform vanilla linear attention in suitable regimes, including vector gating, and demonstrates that deeper GLA architectures correspond to more WPGD steps, improving performance in multitask ICL. Together, these theoretical and empirical findings offer a principled explanation for gating as a mechanism to enable efficient, context-sensitive learning in linear-attention architectures. The insights have practical implications for designing scalable, gate-aware attention in recurrent decoding and ICL systems.
Abstract
Linear attention methods offer a compelling alternative to softmax attention due to their efficiency in recurrent decoding. Recent research has focused on enhancing standard linear attention by incorporating gating while retaining its computational benefits. Such Gated Linear Attention (GLA) architectures include competitive models such as Mamba and RWKV. In this work, we investigate the in-context learning capabilities of the GLA model and make the following contributions. We show that a multilayer GLA can implement a general class of Weighted Preconditioned Gradient Descent (WPGD) algorithms with data-dependent weights. These weights are induced by the gating mechanism and the input, enabling the model to control the contribution of individual tokens to prediction. To further understand the mechanics of this weighting, we introduce a novel data model with multitask prompts and characterize the optimization landscape of learning a WPGD algorithm. Under mild conditions, we establish the existence and uniqueness (up to scaling) of a global minimum, corresponding to a unique WPGD solution. Finally, we translate these findings to explore the optimization landscape of GLA and shed light on how gating facilitates context-aware learning and when it is provably better than vanilla linear attention.
