Table of Contents
Fetching ...

Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

Jerry Yao-Chieh Hu, Xiwen Zhang, Maojiang Su, Zhao Song, Han Liu

TL;DR

The paper analyzes a minimalist one-head softmax attention model for learning monotone $k$-bit Boolean functions with $k=\Theta(d)$. It proves an upper bound under teacher forcing: with $n=\Omega(d^{\varepsilon})$ samples, a single gradient update identifies the relevant $k$ bits and computes the AND/OR with vanishing error. It also establishes a matching lower bound showing that, without intermediate supervision, any polynomial-time learner fails to recover the $k$-bit subset even with exponentially many samples, highlighting a sharp supervision gap. The results reveal that architectural depth is not the bottleneck; rather, the training regime and auxiliary signals determine learnability, with practical implications for curriculum design and inductive biases in simple attention models. Overall, the work delineates when minimal attention can reason about high-arity Boolean tasks and why carefully designed supervision can unlock its latent capabilities.

Abstract

We study the computational limits of learning $k$-bit Boolean functions (specifically, $\mathrm{AND}$, $\mathrm{OR}$, and their noisy variants), using a minimalist single-head softmax-attention mechanism, where $k=Θ(d)$ relevant bits are selected from $d$ inputs. We show that these simple $\mathrm{AND}$ and $\mathrm{OR}$ functions are unsolvable with a single-head softmax-attention mechanism alone. However, with teacher forcing, the same minimalist attention is capable of solving them. These findings offer two key insights: Architecturally, solving these Boolean tasks requires only minimalist attention, without deep Transformer blocks or FFNs. Methodologically, one gradient descent update with supervision suffices and replaces the multi-step Chain-of-Thought (CoT) reasoning scheme of [Kim and Suzuki, ICLR 2025] for solving Boolean problems. Together, the bounds expose a fundamental gap between what this minimal architecture achieves under ideal supervision and what is provably impossible under standard training.

Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

TL;DR

The paper analyzes a minimalist one-head softmax attention model for learning monotone -bit Boolean functions with . It proves an upper bound under teacher forcing: with samples, a single gradient update identifies the relevant bits and computes the AND/OR with vanishing error. It also establishes a matching lower bound showing that, without intermediate supervision, any polynomial-time learner fails to recover the -bit subset even with exponentially many samples, highlighting a sharp supervision gap. The results reveal that architectural depth is not the bottleneck; rather, the training regime and auxiliary signals determine learnability, with practical implications for curriculum design and inductive biases in simple attention models. Overall, the work delineates when minimal attention can reason about high-arity Boolean tasks and why carefully designed supervision can unlock its latent capabilities.

Abstract

We study the computational limits of learning -bit Boolean functions (specifically, , , and their noisy variants), using a minimalist single-head softmax-attention mechanism, where relevant bits are selected from inputs. We show that these simple and functions are unsolvable with a single-head softmax-attention mechanism alone. However, with teacher forcing, the same minimalist attention is capable of solving them. These findings offer two key insights: Architecturally, solving these Boolean tasks requires only minimalist attention, without deep Transformer blocks or FFNs. Methodologically, one gradient descent update with supervision suffices and replaces the multi-step Chain-of-Thought (CoT) reasoning scheme of [Kim and Suzuki, ICLR 2025] for solving Boolean problems. Together, the bounds expose a fundamental gap between what this minimal architecture achieves under ideal supervision and what is provably impossible under standard training.

Paper Structure

This paper contains 48 sections, 14 theorems, 97 equations.

Key Result

Theorem 1.1

With intermediate supervision that exposes the Boolean label during training, the initial gradient already aligns with the indicator of the true feature subset. A single gradient update is enough to drive the model’s attention weights to the correct $k$ positions, yielding vanishing classification e

Theorems & Definitions (32)

  • Theorem 1.1: Upper bound (Efficient Learnability with Teacher Forcing), Informal Version of Theorem \ref{['thm:TF_with_intermediate_layer']}
  • Theorem 1.2: Lower bound (Intractability under End-to-End Training), Informal Version of Theorem \ref{['thm:hardness_of_boolean']}
  • Remark 3.1
  • Definition 3.2: Learning $k$-bit Boolean Functions
  • Remark 3.3: Learning Support vs. Learning Output
  • Theorem 4.1: Upper Bound: Softmax Attention Provably Solve Definition \ref{['def:problem_def_boolean']} with Teacher Forcing
  • proof : Proof Sketch
  • Theorem 4.2: Hardness of Finite-Sample Boolean
  • proof
  • Claim 4.3
  • ...and 22 more