Table of Contents
Fetching ...

HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts

Neil He, Rishabh Anand, Hiren Madhu, Ali Maatouk, Smita Krishnaswamy, Leandros Tassiulas, Menglin Yang, Rex Ying

TL;DR

This work tackles the misalignment between standard Euclidean geometry and the hierarchical structure of language by proposing HELM, a family of fully hyperbolic large language models built in the Lorentz space with curvature $K<0$. Key innovations include MiCE, a mixture of curvature experts; HoPE for hyperbolic rotary positional encoding; HMLA for memory-efficient attention; and Hyperbolic RMSNorm, enabling stable large-scale training. Trained on about $5$ billion tokens, HELM variants (HELM-D and HELM-MiCE) outperform comparable Euclidean models on benchmarks such as MMLU and ARC, with improvements up to around $4ackslash ext{%}$, and ablations confirm the benefit of distinct-curvature experts. Overall, the results demonstrate that hyperbolic geometry can yield stronger reasoning and hierarchical representation in scalable LLM pretraining, with practical benefits in efficiency and performance across STEM, knowledge, and commonsense tasks.

Abstract

Large language models (LLMs) have shown great success in text modeling tasks across domains. However, natural language exhibits inherent semantic hierarchies and nuanced geometric structure, which current LLMs do not capture completely owing to their reliance on Euclidean operations. Recent studies have also shown that not respecting the geometry of token embeddings leads to training instabilities and degradation of generative capabilities. These findings suggest that shifting to non-Euclidean geometries can better align language models with the underlying geometry of text. We thus propose to operate fully in Hyperbolic space, known for its expansive, scale-free, and low-distortion properties. We thus introduce HELM, a family of HypErbolic Large Language Models, offering a geometric rethinking of the Transformer-based LLM that addresses the representational inflexibility, missing set of necessary operations, and poor scalability of existing hyperbolic LMs. We additionally introduce a Mixture-of-Curvature Experts model, HELM-MICE, where each expert operates in a distinct curvature space to encode more fine-grained geometric structure from text, as well as a dense model, HELM-D. For HELM-MICE, we further develop hyperbolic Multi-Head Latent Attention (HMLA) for efficient, reduced-KV-cache training and inference. For both models, we develop essential hyperbolic equivalents of rotary positional encodings and RMS normalization. We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our HELM architectures -- up to 4% -- over popular Euclidean architectures used in LLaMA and DeepSeek, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale LM pretraining.

HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts

TL;DR

This work tackles the misalignment between standard Euclidean geometry and the hierarchical structure of language by proposing HELM, a family of fully hyperbolic large language models built in the Lorentz space with curvature . Key innovations include MiCE, a mixture of curvature experts; HoPE for hyperbolic rotary positional encoding; HMLA for memory-efficient attention; and Hyperbolic RMSNorm, enabling stable large-scale training. Trained on about billion tokens, HELM variants (HELM-D and HELM-MiCE) outperform comparable Euclidean models on benchmarks such as MMLU and ARC, with improvements up to around , and ablations confirm the benefit of distinct-curvature experts. Overall, the results demonstrate that hyperbolic geometry can yield stronger reasoning and hierarchical representation in scalable LLM pretraining, with practical benefits in efficiency and performance across STEM, knowledge, and commonsense tasks.

Abstract

Large language models (LLMs) have shown great success in text modeling tasks across domains. However, natural language exhibits inherent semantic hierarchies and nuanced geometric structure, which current LLMs do not capture completely owing to their reliance on Euclidean operations. Recent studies have also shown that not respecting the geometry of token embeddings leads to training instabilities and degradation of generative capabilities. These findings suggest that shifting to non-Euclidean geometries can better align language models with the underlying geometry of text. We thus propose to operate fully in Hyperbolic space, known for its expansive, scale-free, and low-distortion properties. We thus introduce HELM, a family of HypErbolic Large Language Models, offering a geometric rethinking of the Transformer-based LLM that addresses the representational inflexibility, missing set of necessary operations, and poor scalability of existing hyperbolic LMs. We additionally introduce a Mixture-of-Curvature Experts model, HELM-MICE, where each expert operates in a distinct curvature space to encode more fine-grained geometric structure from text, as well as a dense model, HELM-D. For HELM-MICE, we further develop hyperbolic Multi-Head Latent Attention (HMLA) for efficient, reduced-KV-cache training and inference. For both models, we develop essential hyperbolic equivalents of rotary positional encodings and RMS normalization. We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our HELM architectures -- up to 4% -- over popular Euclidean architectures used in LLaMA and DeepSeek, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale LM pretraining.

Paper Structure

This paper contains 38 sections, 10 theorems, 38 equations, 4 figures, 9 tables.

Key Result

Proposition 4.1

Let $\mathbf{X}$ be $T$ tokens with $\mathbf{x}_i\in\mathbb{L}^{K,d}$. Let $\mathbf{Q}, \mathbf{K}$ be queries and keys as in Eq:hyp_att. Then $-d^2_\mathcal{L}\left(\mathrm{\mathrm{HoPE}\left(\mathbf{q}_a \right), \mathrm{HoPE}\left(\mathbf{k}_b\right))}\right) = g(\mathbf{x}_a, \mathbf{x}_b; a-b)$

Figures (4)

  • Figure 1: Ricci curvature distribution of token embeddings from decoder-only LLMs showing substantial variation of negative curvature, implying higher local hyperbolicity.
  • Figure 2: MiCE module architecture. Routed experts are selected through a gating module. The token are project from input manifold to expert manifold and then passed through each expert. The output of each expert are then project back to the input manifold and merged together through Lorentzian centroid. This modules allows experts to learn from distinct curvature spaces to allow for more granularity.
  • Figure 3: HMLA framework. The embeddings are projected into latent space and then upward projected into queries, keys, and values. Additional decoupled queries and a shared key are created for hyperbolic positional encoding through HoPE. The queries and keys are concatenated together before performing hyperbolic self-attention.
  • Figure 5: HELM architecture. The input tokens are mapped to hyperbolic word embeddings before being processed by a series of $L$ decoder blocks, comprising an attention block and an FFN block. The attention block (blue) can either be hyperbolic self-attention or HMLA, while the FFN block (yellow) can either be a HFFN or MiCE layer. The output of the decoder blocks is mapped to logits. Residual connections are omitted for brevity.

Theorems & Definitions (15)

  • Proposition 4.1
  • Proposition 4.2
  • Proposition 4.3
  • Proposition 4.4
  • Proposition 4.5
  • Proposition
  • proof
  • Proposition
  • proof
  • Proposition
  • ...and 5 more