Hyperbolic Fine-Tuning for Large Language Models

Menglin Yang; Ram Samarth B B; Aosong Feng; Bo Xiong; Jihong Liu; Irwin King; Rex Ying

Hyperbolic Fine-Tuning for Large Language Models

Menglin Yang, Ram Samarth B B, Aosong Feng, Bo Xiong, Jihong Liu, Irwin King, Rex Ying

TL;DR

This work investigates whether Euclidean token spaces are optimal for large language models and uncovers strong hyperbolic, tree-like structures in token embeddings, with high-frequency tokens clustering near the origin and low-frequency terms lying farther out. Building on this insight, the authors introduce HypLoRA, a hyperbolic, parameter-efficient fine-tuning method that performs low-rank adaptation directly on the hyperbolic manifold via a Direct Lorentz Low-Rank Transformation, preserving geometric properties while remaining computationally efficient. The paper provides both global (power-law frequency $γ$) and local ($δ$-hyperbolicity) analyses and establishes a theoretical link between token frequency distributions and hyperbolic curvature. Extensive experiments on arithmetic and commonsense reasoning across multiple base models demonstrate that HypLoRA yields consistent gains over Euclidean LoRA and other adapters, validating the practical value of incorporating hyperbolic inductive biases into PEFT. Overall, the work offers a principled approach to aligning fine-tuning with the intrinsic geometry of language, enabling more effective reasoning with modest additional computational cost ($O(r(d+k))$) and similar memory footprints.

Abstract

Large language models (LLMs) have demonstrated remarkable performance across various tasks. However, it remains an open question whether the default Euclidean space is the most suitable choice for LLMs. In this study, we investigate the geometric characteristics of LLMs, focusing specifically on tokens and their embeddings. Our findings reveal that token frequency follows a power-law distribution, where high-frequency tokens (e.g., the, that ) constitute the minority, while low-frequency tokens (e.g., apple, dog) constitute the majority. Furthermore, high-frequency tokens cluster near the origin, whereas low-frequency tokens are positioned farther away in the embedding space. Additionally, token embeddings exhibit hyperbolic characteristics, indicating a latent tree-like structure within the embedding space. Motivated by these observations, we propose HypLoRA, an efficient fine-tuning approach that operates in hyperbolic space to exploit these underlying hierarchical structures better. HypLoRA performs low-rank adaptation directly in hyperbolic space, thereby preserving hyperbolic modeling capabilities throughout the fine-tuning process. Extensive experiments across various base models and reasoning benchmarks, specifically arithmetic and commonsense reasoning tasks, demonstrate that HypLoRA substantially improves LLM performance.

Hyperbolic Fine-Tuning for Large Language Models

TL;DR

) and local (

-hyperbolicity) analyses and establishes a theoretical link between token frequency distributions and hyperbolic curvature. Extensive experiments on arithmetic and commonsense reasoning across multiple base models demonstrate that HypLoRA yields consistent gains over Euclidean LoRA and other adapters, validating the practical value of incorporating hyperbolic inductive biases into PEFT. Overall, the work offers a principled approach to aligning fine-tuning with the intrinsic geometry of language, enabling more effective reasoning with modest additional computational cost (

) and similar memory footprints.

Abstract

Paper Structure (30 sections, 1 theorem, 35 equations, 6 figures, 13 tables)

This paper contains 30 sections, 1 theorem, 35 equations, 6 figures, 13 tables.

Introduction
Related Work
Preliminary
Investigation
Global Token Statistics
$\delta$-Hyperbolicity of Local Token Embeddings
Connection between Power-law Distribution and Hyperbolic Geometry
Hyperbolic Fine-Tuning for LLMs
Experimental Settings
Experimental Results
Conclusion
More Investigation Results
Token Frequency and Norm Distribution on Mathematical Reasoning
Token Frequency and Norm Distribution on Commonsense Reasoning
Hyperbolicity in the Final Hidden Layer of LLMs
...and 15 more sections

Key Result

Proposition 1

Let $\mathbf{x} \in \mathbb{R}^d$ denote the input token embeddings. The HypLoRA adaptation, applied to $\mathbf{x}$, involves a sequence of projection into hyperbolic space, a Direct Lorentz Low-Rank Transformation (LLR), and projection back to Euclidean space. Due to the non-linear nature of these

Figures (6)

Figure 1: Token frequency distribution and token frequency vs. norm analysis for GSM8K (Group 1) and AQuA (Group 2) datasets in LLaMA3-8B. For each group, the left panels show the token frequency distributions (power-law distribution), while the right panels illustrate the relationship between token frequency and the corresponding norms. This visualization reveals the underlying geometric structure of the token embeddings. For additional data analysis and visualizations, please refer to Appendix \ref{['sec:appendix_more_investigation']}.
Figure 2: GPU (A100) usage during inference
Figure 3: Token frequency distribution (top row) and token frequency vs. norm (bottom row) across different mathematical reasoning datasets in LLaMA3. The top row shows the power-law distribution of token frequencies with the decay rate ($\gamma$) annotated for each dataset. The bottom row illustrates the relationship between token frequency and token norm, binned and colored by frequency, where higher token norms correspond to lower frequencies.
Figure 4: Token frequency distribution (top two rows) and token frequency vs. norm (bottom two rows) across different commonsense reasoning datasets in LLaMA3. The top two rows show the power-law distribution of token frequencies with the decay rate ($\gamma$) annotated for each dataset. The bottom two rows illustrate the relationship between token frequency and token norm, binned and colored by frequency, where higher token norms correspond to lower frequencies.
Figure 5: Results for varying curvature $K$ on the Gemma3-4B model
...and 1 more figures

Theorems & Definitions (3)

Proposition 1
proof
Remark 1

Hyperbolic Fine-Tuning for Large Language Models

TL;DR

Abstract

Hyperbolic Fine-Tuning for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (3)