Table of Contents
Fetching ...

Additive Multi-Step Markov Chains and the Curse of Dimensionality in Large Language Models

O. V. Usatenko, S. S. Melnyk, G. M. Pritula

TL;DR

A theoretically feasible approximation of LLM dynamics using N-order additive Markov chains, which allows the conditional probability of the next token to be decomposed into a superposition of contributions from multiple historical depths, reducing the combinatorial explosion typically associated with high-order Markov processes.

Abstract

Large-scale language models (LLMs) operate in extremely high-dimensional state spaces, where both token embeddings and their hidden representations create complex dependencies that are not easily reduced to classical Markov structures. In this paper, we explore a theoretically feasible approximation of LLM dynamics using N-order additive Markov chains. Such models allow the conditional probability of the next token to be decomposed into a superposition of contributions from multiple historical depths, reducing the combinatorial explosion typically associated with high-order Markov processes. The main result of the work is the establishment of a correspondence between an additive multi-step chain and a chain with a step-wise memory function. This equivalence allowed the introduction of the concept of information temperature not only for stepwise but also for additive N-order Markov chains.

Additive Multi-Step Markov Chains and the Curse of Dimensionality in Large Language Models

TL;DR

A theoretically feasible approximation of LLM dynamics using N-order additive Markov chains, which allows the conditional probability of the next token to be decomposed into a superposition of contributions from multiple historical depths, reducing the combinatorial explosion typically associated with high-order Markov processes.

Abstract

Large-scale language models (LLMs) operate in extremely high-dimensional state spaces, where both token embeddings and their hidden representations create complex dependencies that are not easily reduced to classical Markov structures. In this paper, we explore a theoretically feasible approximation of LLM dynamics using N-order additive Markov chains. Such models allow the conditional probability of the next token to be decomposed into a superposition of contributions from multiple historical depths, reducing the combinatorial explosion typically associated with high-order Markov processes. The main result of the work is the establishment of a correspondence between an additive multi-step chain and a chain with a step-wise memory function. This equivalence allowed the introduction of the concept of information temperature not only for stepwise but also for additive N-order Markov chains.
Paper Structure (17 sections, 40 equations, 3 figures)

This paper contains 17 sections, 40 equations, 3 figures.

Figures (3)

  • Figure 1: The correlation function $K(r)$ of additive Markov chain constructed using the memory function $F(r)$, Eq. \ref{['eqmf']} (shown in the inset) with memory length $r=N=10$ and parameters $\overline{a}=1/2$ and $F_0=0.15$. The solid line represents the numerical solution of equation \ref{['KorrBin']}. The dots represent the calculations by definition \ref{['KorrDef']} of generating a numerical sequence with CPDF \ref{['CondPr_power']}.
  • Figure 2: The dependence of inverse temperature $\tau ^{-1}$ defined by Eqs. \ref{['All_tau']} and \ref{['Mu2']} for the additive Markov chains with CPDF Eq. \ref{['CondPr_power1']} and memory function \ref{['eqmf']} for $N=5,\,8,\,20$ (the corresponding lines are marked in the legend). The values of parameter $F_0$ when the inverse temperature goes asymptotically to infinity are determined by conditions \ref{['def_ergod']}, i.e., $|F_0| \sum_{r=1}^N \left(1 - \dfrac{r}{N}\right)=1.$
  • Figure 3: The lower curve is the dependence of conditional entropies defined by Eqs. \ref{['entro_block']} and \ref{['ShennEntr']} for the additive Markov chains with CPDF Eq. \ref{['CondPr_power1']} and memory function \ref{['eqmf']} with $N=10$ and $F_0 = 0.15$. The calculated parameter $\mu = 0.345$, defined by the equation \ref{['Mu2']}, gives the entropy of the step-wise chain represented by the upper curve.