Table of Contents
Fetching ...

Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability of Decoder-Only Transformer Models

Bahman Torkamandi

TL;DR

This work investigates whether trainability boundaries in decoder-only transformers exhibit fractal, self-similar structure in the hyperparameter space. It introduces a consistent convergence measure and analyzes the learning-rate landscape across attention and feed-forward layers at multiple scales, reporting self-similar, chaotic boundaries with fractal dimensions around $1.9772$ across several granularities. The findings highlight scale-invariant sensitivity in training dynamics and suggest that such fractal trainability features may persist in larger autoregressive models, motivating broader future studies. The approach combines a rigorous convergence criterion with visualized hyperparameter landscapes to illuminate the complex structure of trainability in modern transformer architectures.

Abstract

In the realm of fractal geometry, intricate structures emerge from simple iterative processes that partition parameter spaces into regions of stability and instability. Likewise, training large language models involves iteratively applying update functions, such as Adam, where even slight hyperparameter adjustments can shift the training process from convergence to divergence. Recent evidence from miniature neural networks suggests that the boundary separating these outcomes displays fractal characteristics. Building on these insights, this study extends them to medium-sized, decoder-only transformer architectures by employing a more consistent convergence measure and examining the learning rate hyperparameter landscape for attention and fully connected layers. The results show that the trainability frontier is not a simple threshold; rather, it forms a self-similar yet seemingly random structure at multiple scales, with statistically consistent and repeating patterns. Within this landscape, a region of stable convergence is surrounded by a complex chaotic border, illustrating the sensitive nature of the underlying training dynamics.

Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability of Decoder-Only Transformer Models

TL;DR

This work investigates whether trainability boundaries in decoder-only transformers exhibit fractal, self-similar structure in the hyperparameter space. It introduces a consistent convergence measure and analyzes the learning-rate landscape across attention and feed-forward layers at multiple scales, reporting self-similar, chaotic boundaries with fractal dimensions around across several granularities. The findings highlight scale-invariant sensitivity in training dynamics and suggest that such fractal trainability features may persist in larger autoregressive models, motivating broader future studies. The approach combines a rigorous convergence criterion with visualized hyperparameter landscapes to illuminate the complex structure of trainability in modern transformer architectures.

Abstract

In the realm of fractal geometry, intricate structures emerge from simple iterative processes that partition parameter spaces into regions of stability and instability. Likewise, training large language models involves iteratively applying update functions, such as Adam, where even slight hyperparameter adjustments can shift the training process from convergence to divergence. Recent evidence from miniature neural networks suggests that the boundary separating these outcomes displays fractal characteristics. Building on these insights, this study extends them to medium-sized, decoder-only transformer architectures by employing a more consistent convergence measure and examining the learning rate hyperparameter landscape for attention and fully connected layers. The results show that the trainability frontier is not a simple threshold; rather, it forms a self-similar yet seemingly random structure at multiple scales, with statistically consistent and repeating patterns. Within this landscape, a region of stable convergence is surrounded by a complex chaotic border, illustrating the sensitive nature of the underlying training dynamics.
Paper Structure (10 sections, 5 equations, 25 figures)

This paper contains 10 sections, 5 equations, 25 figures.

Figures (25)

  • Figure 1: Fractal-like Dust Clusters From Monte Carlo Random Placements kaye1994random.
  • Figure 2: Architecture of The Language Model.
  • Figure 3: Loss Functions Illustrating Various Convergence Behaviors.
  • Figure 4: Convergence Measure Binary Heatmap, Granularity $10^{-5}$.
  • Figure 5: Boundaries Between Convergence and Divergence Regions, Granularity $10^{-5}$, Box-Count Dimension: $1.9772$.
  • ...and 20 more figures