Table of Contents
Fetching ...

Tropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms

Baran Hashemi, Kurt Pasque, Chris Teska, Ruriko Yoshida

TL;DR

The paper addresses the mismatch between Softmax attention and the polyhedral, DP-style reasoning central to combinatorial optimization. It introduces Tropical Attention, mapping queries/keys/values into tropical projective space and using the tropical Hilbert metric to perform max-plus style aggregation, yielding a polyhedral, 1-Lipschitz attention core. The authors prove that Multi-Head Tropical Attention (MHTA) universally approximates max-plus dynamic programs and realizes tropical transitive closure with polynomial resources, while empirically achieving strong out-of-distribution generalization, robustness to perturbations, and faster inference across NP-hard/complete problems. This work advances neural algorithmic reasoning by enabling sharper, more expressive Large Reasoning Models capable of tackling discrete optimization tasks across domains such as cryptography, phylogenetics, and physics.

Abstract

Can algebraic geometry enhance the sharpness, robustness, and interpretability of modern neural reasoning models by equipping them with a mathematically grounded inductive bias? To answer this, we introduce Tropical Attention, an attention mechanism grounded in tropical geometry that lifts the attention kernel into tropical projective space, where reasoning is piecewise-linear and 1-Lipschitz, thus preserving the polyhedral decision structure inherent to combinatorial reasoning. We prove that Multi-Head Tropical Attention (MHTA) stacks universally approximate tropical circuits and realize tropical transitive closure through composition, achieving polynomial resource bounds without invoking recurrent mechanisms. These guarantees explain why the induced polyhedral decision boundaries remain sharp and scale-invariant, rather than smoothed by Softmax. Empirically, we show that Tropical Attention delivers stronger out-of-distribution generalization in both length and value, with high robustness against perturbative noise, and substantially faster inference with fewer parameters compared to Softmax-based and recurrent attention baselines. For the first time, we extend neural algorithmic reasoning beyond PTIME problems to NP-hard and NP-complete problems, paving the way toward sharper and more expressive Large Reasoning Models (LRMs) capable of tackling complex combinatorial challenges in phylogenetics, cryptography, particle physics, and mathematical discovery.

Tropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms

TL;DR

The paper addresses the mismatch between Softmax attention and the polyhedral, DP-style reasoning central to combinatorial optimization. It introduces Tropical Attention, mapping queries/keys/values into tropical projective space and using the tropical Hilbert metric to perform max-plus style aggregation, yielding a polyhedral, 1-Lipschitz attention core. The authors prove that Multi-Head Tropical Attention (MHTA) universally approximates max-plus dynamic programs and realizes tropical transitive closure with polynomial resources, while empirically achieving strong out-of-distribution generalization, robustness to perturbations, and faster inference across NP-hard/complete problems. This work advances neural algorithmic reasoning by enabling sharper, more expressive Large Reasoning Models capable of tackling discrete optimization tasks across domains such as cryptography, phylogenetics, and physics.

Abstract

Can algebraic geometry enhance the sharpness, robustness, and interpretability of modern neural reasoning models by equipping them with a mathematically grounded inductive bias? To answer this, we introduce Tropical Attention, an attention mechanism grounded in tropical geometry that lifts the attention kernel into tropical projective space, where reasoning is piecewise-linear and 1-Lipschitz, thus preserving the polyhedral decision structure inherent to combinatorial reasoning. We prove that Multi-Head Tropical Attention (MHTA) stacks universally approximate tropical circuits and realize tropical transitive closure through composition, achieving polynomial resource bounds without invoking recurrent mechanisms. These guarantees explain why the induced polyhedral decision boundaries remain sharp and scale-invariant, rather than smoothed by Softmax. Empirically, we show that Tropical Attention delivers stronger out-of-distribution generalization in both length and value, with high robustness against perturbative noise, and substantially faster inference with fewer parameters compared to Softmax-based and recurrent attention baselines. For the first time, we extend neural algorithmic reasoning beyond PTIME problems to NP-hard and NP-complete problems, paving the way toward sharper and more expressive Large Reasoning Models (LRMs) capable of tackling complex combinatorial challenges in phylogenetics, cryptography, particle physics, and mathematical discovery.

Paper Structure

This paper contains 28 sections, 9 theorems, 24 equations, 3 figures, 5 tables, 1 algorithm.

Key Result

Lemma 3.1

For every embedded coordinate $i\in[N]$, the function where $\phi_\lambda$ is a (projective) valuation map. Hence the shifted map $\widetilde{v}(x)=v_\lambda(x)+\lambda=\log(\max(0,x))$ is an Archimedean valuation in the classical sense, and $\Phi$ is a matrix‑valued valuation modulo tropical scalars; its image lies in the tropical simplex.

Figures (3)

  • Figure 1: (top) Tropical Attention with sharp attention maps on learning the notorious Quickselect algorithm, showcasing a size-invariance and OOD lengths generalization behavior far beyond training ($8 \rightarrow 1024$). In contrast, both (middle) adaptive-softmax and (bottom) vanilla-softmax heads dilute and uniformly disperse as sequence length grows, failing to generalize. Each column evaluates the models on a new batch of independently drawn inputs of increasing length. Since the position of the target $k$-th element is different in each batch, the pattern of attention naturally changes to reflect the new data.
  • Figure 2: Stacked attention head representations for Quickselect under (a) Vanilla, (b) Adaptive, and (c) Tropical models. Each model was trained on length 8 sequences and was evaluated from Left to Right on length 16 to 1024 sequences. Each image was generated by a batch of 32 inputs. The columns are the 8 largest keys by $\ell_2$-norm. Heatmap values are the attention of the row item at the column key.
  • Figure 3: (top) Tropical Attention with sharp attention maps on learning the Knapsack algorithm, showcasing a size-invariance and OOD lengths generalization behavior far beyond training ($16 \rightarrow 1024$). In contrast, both (middle) adaptive-softmax and (bottom) vanilla-softmax heads dilute and disperse as sequence length grows, failing to generalize.

Theorems & Definitions (23)

  • Lemma 3.1
  • Definition 3.1: Multi‑head Tropical Attention (MHTA)
  • Theorem 3.2: Simulation of max–plus dynamic programs
  • Remark 3.1
  • Definition B.1
  • Definition B.2
  • Definition B.3
  • Definition B.4
  • proof : Proof of lemma \ref{['lem:phi-valuation']}
  • Lemma C.1: Head–level Weighted $\oplus$ gate
  • ...and 13 more