Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

Jerry Yao-Chieh Hu; Xiwen Zhang; Maojiang Su; Zhao Song; Han Liu

Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

Jerry Yao-Chieh Hu, Xiwen Zhang, Maojiang Su, Zhao Song, Han Liu

TL;DR

The paper analyzes a minimalist one-head softmax attention model for learning monotone $k$-bit Boolean functions with $k=\Theta(d)$. It proves an upper bound under teacher forcing: with $n=\Omega(d^{\varepsilon})$ samples, a single gradient update identifies the relevant $k$ bits and computes the AND/OR with vanishing error. It also establishes a matching lower bound showing that, without intermediate supervision, any polynomial-time learner fails to recover the $k$-bit subset even with exponentially many samples, highlighting a sharp supervision gap. The results reveal that architectural depth is not the bottleneck; rather, the training regime and auxiliary signals determine learnability, with practical implications for curriculum design and inductive biases in simple attention models. Overall, the work delineates when minimal attention can reason about high-arity Boolean tasks and why carefully designed supervision can unlock its latent capabilities.

Abstract

We study the computational limits of learning $k$-bit Boolean functions (specifically, $\mathrm{AND}$, $\mathrm{OR}$, and their noisy variants), using a minimalist single-head softmax-attention mechanism, where $k=Θ(d)$ relevant bits are selected from $d$ inputs. We show that these simple $\mathrm{AND}$ and $\mathrm{OR}$ functions are unsolvable with a single-head softmax-attention mechanism alone. However, with teacher forcing, the same minimalist attention is capable of solving them. These findings offer two key insights: Architecturally, solving these Boolean tasks requires only minimalist attention, without deep Transformer blocks or FFNs. Methodologically, one gradient descent update with supervision suffices and replaces the multi-step Chain-of-Thought (CoT) reasoning scheme of [Kim and Suzuki, ICLR 2025] for solving Boolean problems. Together, the bounds expose a fundamental gap between what this minimal architecture achieves under ideal supervision and what is provably impossible under standard training.

Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

TL;DR

The paper analyzes a minimalist one-head softmax attention model for learning monotone

-bit Boolean functions with

. It proves an upper bound under teacher forcing: with

samples, a single gradient update identifies the relevant

bits and computes the AND/OR with vanishing error. It also establishes a matching lower bound showing that, without intermediate supervision, any polynomial-time learner fails to recover the

-bit subset even with exponentially many samples, highlighting a sharp supervision gap. The results reveal that architectural depth is not the bottleneck; rather, the training regime and auxiliary signals determine learnability, with practical implications for curriculum design and inductive biases in simple attention models. Overall, the work delineates when minimal attention can reason about high-arity Boolean tasks and why carefully designed supervision can unlock its latent capabilities.

Abstract

We study the computational limits of learning

-bit Boolean functions (specifically,

, and their noisy variants), using a minimalist single-head softmax-attention mechanism, where

relevant bits are selected from

inputs. We show that these simple

and

functions are unsolvable with a single-head softmax-attention mechanism alone. However, with teacher forcing, the same minimalist attention is capable of solving them. These findings offer two key insights: Architecturally, solving these Boolean tasks requires only minimalist attention, without deep Transformer blocks or FFNs. Methodologically, one gradient descent update with supervision suffices and replaces the multi-step Chain-of-Thought (CoT) reasoning scheme of [Kim and Suzuki, ICLR 2025] for solving Boolean problems. Together, the bounds expose a fundamental gap between what this minimal architecture achieves under ideal supervision and what is provably impossible under standard training.

Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

TL;DR

Abstract

Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (32)