Table of Contents
Fetching ...

CAT: Curvature-Adaptive Transformers for Geometry-Aware Learning

Ryan Y. Lin, Siddhartha Ojha, Nicholas Bai

TL;DR

The paper tackles the limitation of Transformer attention assuming Euclidean geometry by introducing the Curvature-Adaptive Transformer (CAT), a model that learns per-token routing across Euclidean, hyperbolic, and spherical attention branches. Each geometry employs principled manifold operations, and outputs are fused through a differentiable routing mechanism, enabling end-to-end optimization. On knowledge graph completion benchmarks FB15k-237 and WN18RR, CAT achieves approximately 10% relative improvements in MRR and Hits@10 with only about 5% additional parameters and comparable inference time. This approach demonstrates that learned mixtures of geometries can outperform any single fixed geometry and provides an interpretable foundation for geometry-aware learning across modalities.

Abstract

Transformers achieve strong performance across diverse domains but implicitly assume Euclidean geometry in their attention mechanisms, limiting their effectiveness on data with non-Euclidean structure. While recent extensions to hyperbolic and spherical spaces show promise for hierarchical and cyclical patterns, respectively, they require committing to a single geometry a priori, reducing flexibility when data exhibits mixed geometric properties. We introduce the Curvature-Adaptive Transformer (CAT), a novel architecture that dynamically learns per-token routing across three geometric attention branches through a lightweight, differentiable gating mechanism. Unlike fixed-geometry approaches, CAT enables adaptive geometric specialization, routing tokens to the appropriate curvature based on their local relational structure. The routing network provides interpretable curvature preferences while each branch employs geometry-specific operations optimized for its respective manifold. On knowledge graph completion benchmarks (FB15k-237, WN18RR), CAT achieves approximately 10% improvements in MRR and Hits@10 over fixed-geometry baselines with minimal overhead (5% parameter increase, comparable inference time). These results demonstrate that learned geometric adaptation outperforms any single fixed geometry for complex relational reasoning, establishing CAT as a scalable and interpretable foundation for mixture-of-geometry architectures across language, vision, and multimodal domains.

CAT: Curvature-Adaptive Transformers for Geometry-Aware Learning

TL;DR

The paper tackles the limitation of Transformer attention assuming Euclidean geometry by introducing the Curvature-Adaptive Transformer (CAT), a model that learns per-token routing across Euclidean, hyperbolic, and spherical attention branches. Each geometry employs principled manifold operations, and outputs are fused through a differentiable routing mechanism, enabling end-to-end optimization. On knowledge graph completion benchmarks FB15k-237 and WN18RR, CAT achieves approximately 10% relative improvements in MRR and Hits@10 with only about 5% additional parameters and comparable inference time. This approach demonstrates that learned mixtures of geometries can outperform any single fixed geometry and provides an interpretable foundation for geometry-aware learning across modalities.

Abstract

Transformers achieve strong performance across diverse domains but implicitly assume Euclidean geometry in their attention mechanisms, limiting their effectiveness on data with non-Euclidean structure. While recent extensions to hyperbolic and spherical spaces show promise for hierarchical and cyclical patterns, respectively, they require committing to a single geometry a priori, reducing flexibility when data exhibits mixed geometric properties. We introduce the Curvature-Adaptive Transformer (CAT), a novel architecture that dynamically learns per-token routing across three geometric attention branches through a lightweight, differentiable gating mechanism. Unlike fixed-geometry approaches, CAT enables adaptive geometric specialization, routing tokens to the appropriate curvature based on their local relational structure. The routing network provides interpretable curvature preferences while each branch employs geometry-specific operations optimized for its respective manifold. On knowledge graph completion benchmarks (FB15k-237, WN18RR), CAT achieves approximately 10% improvements in MRR and Hits@10 over fixed-geometry baselines with minimal overhead (5% parameter increase, comparable inference time). These results demonstrate that learned geometric adaptation outperforms any single fixed geometry for complex relational reasoning, establishing CAT as a scalable and interpretable foundation for mixture-of-geometry architectures across language, vision, and multimodal domains.

Paper Structure

This paper contains 27 sections, 13 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: CATBlock architecture: Input flows through routing MLP to three parallel geometry-specific branches, combined via learned per-token weights.
  • Figure 2: Ternary heatmap visualizing routing weights across geometries on FB15k-237. Tokens concentrate towards Euclidean, with noticeable Euclidean–Hyperbolic mixtures. Spherical contributions remain negligible for this particular dataset.