Table of Contents
Fetching ...

Reasoning in Large Language Models: A Geometric Perspective

Romain Cosentino, Sarath Shekkizhar

TL;DR

This work establishes a connection between the expressive power of LLMs and the density of their self-attention graphs, and demonstrates that the density of these graphs defines the intrinsic dimension of the inputs to the MLP blocks.

Abstract

The advancement of large language models (LLMs) for real-world applications hinges critically on enhancing their reasoning capabilities. In this work, we explore the reasoning abilities of large language models (LLMs) through their geometrical understanding. We establish a connection between the expressive power of LLMs and the density of their self-attention graphs. Our analysis demonstrates that the density of these graphs defines the intrinsic dimension of the inputs to the MLP blocks. We demonstrate through theoretical analysis and toy examples that a higher intrinsic dimension implies a greater expressive capacity of the LLM. We further provide empirical evidence linking this geometric framework to recent advancements in methods aimed at enhancing the reasoning capabilities of LLMs.

Reasoning in Large Language Models: A Geometric Perspective

TL;DR

This work establishes a connection between the expressive power of LLMs and the density of their self-attention graphs, and demonstrates that the density of these graphs defines the intrinsic dimension of the inputs to the MLP blocks.

Abstract

The advancement of large language models (LLMs) for real-world applications hinges critically on enhancing their reasoning capabilities. In this work, we explore the reasoning abilities of large language models (LLMs) through their geometrical understanding. We establish a connection between the expressive power of LLMs and the density of their self-attention graphs. Our analysis demonstrates that the density of these graphs defines the intrinsic dimension of the inputs to the MLP blocks. We demonstrate through theoretical analysis and toy examples that a higher intrinsic dimension implies a greater expressive capacity of the LLM. We further provide empirical evidence linking this geometric framework to recent advancements in methods aimed at enhancing the reasoning capabilities of LLMs.
Paper Structure (12 sections, 1 theorem, 5 equations, 8 figures)

This paper contains 12 sections, 1 theorem, 5 equations, 8 figures.

Key Result

Theorem 2.1

The $i^{\rm th}$ row of the MHA mapping output (eq:multihead) lives in the Minkowski sum of single-head convex hulls as $({\rm MHA}^{(\ell)}({\bm{X}})_{i,.})^\top \in {\mathbb{H}}^{(\ell)}_1(i) + \dots + {\mathbb{H}}^{(\ell)}_H(i)$ where ${\mathbb{H}}^{(\ell)}_h(i) \triangleq {\rm Hull}\left\{({\bm

Figures (8)

  • Figure 1: Continuous Piece-wise Affine view of MLP. 2-dimensional visualization of the input space partitioning induced by a one hidden layer MLP randomly initialized using standard with bias (Left) and zero bias (Right). Each region, depicted by a particular color and bounded by black lines, has a set of CPA parameters $A_{\omega}, B_{\omega}$ described in \ref{['eq:cpa']}. These parameters depend on the per-layer affine parameters and the state of the nonlinearities of the region $\omega$.
  • Figure 2: DNN approximation & induced number of input space regions. The ground truth and approximation of a sine function by an MLP ( (Top)), the number of associated regions the MLP induces in its input space (Middle), and the approximation error (Bottom). We present results for a $1$-hidden layer MLP with $50$ neurons (Left) and $500$ neurons (Right). We note that the model breaks from its linear behavior with the DNN introducing a new region whenever a change of direction in the MLP mapping occurs. Subsequently, we obtain a new affine mapping as per \ref{['eq:cpa']} for each new region created by the model with finer approximation in spaces where the number of regions is higher, as seen in the wider MLP with $500$ neurons. The crucial advantage of DNNs is their ability to adapt the positioning of these regions and learn data-driven partitions.
  • Figure 3: Number of regions as a function of input dimension - Upper bound of number of regions spanned by a $1$-hidden layer MLP ($50$, $100$, and $500$ neurons) concerning the input space intrinsic dimension. We observe that increasing the intrinsic dimension affect exponentially the number of regions. As such, for a given number of neurons, one can artificially increase the number of regions by increasing the intrinsic dimension of the input space. This will be a crucial component to understanding why increasing the size of the prompt via many-shot or CoT induces better reasoning capabilities in LLMs. This will be the central point of \ref{['sec:LLM']} as well as \ref{['sec:EXP']}.
  • Figure 4: LLM approximation & induced number of input space regions - Visualization of $sin(t)$ ($1000$ time bins) approximation by a $1$-block LLM, i.e., embedding $\rightarrow$ attention block (as in \ref{['eq:multihead']}) $\rightarrow$$1$-hidden layer MLP. We display the approximation of the sin function together with the number of regions induced by the MLP in the input space for different numbers of heads and context lengths (Top Left) context length: $10$ and number of heads: $1$, (Top Right) context length: $10$ and number of heads: $10$, (Bottom Left) context length: $100$ and number of heads: $1$, (Bottom Right) context length: $100$ and number of heads: $10$. We observe that both context length and number of heads are inducing an increase in the number of regions spanned by the MLP in the input space, which improves the approximation capabilities of the LLM. This result coincides with our geometrical description.
  • Figure 5: LLM input space regions - (Left) Depiction of the number of regions induced by the MLP block in the input space of the LLM concerning the number of attention heads and context length. (Right) Zoom in on two rows of the left figure, specifically for several attention heads: $5,10$. We observe that increasing both attention heads and context length does increase the number of regions, which as mentioned, leads to better approximation properties. It is important to note that, while changing the number of attention heads can be tedious and require pre-training or fine-tuning, one can seamlessly vary the context length. There is therefore a way to improve LLM approximation capability without interacting with the weights of the model.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Theorem 2.1: causal multi-head Minkowski sum (balestriero2023characterizing)