Table of Contents
Fetching ...

HoloByte: Continuous Hyperspherical Distillation for Tokenizer-Free Modeling

Vladimer Khasia

Abstract

Sequence modeling universally relies on discrete subword tokenization to circumvent the $\mathcal{O}(N^2)$ computational intractability of native byte-level attention. However, this heuristic quantization imposes artificial morphological boundaries, enforces vocabulary dependence, and fractures the continuity of the optimization landscape. To resolve this dichotomy, we introduce \textbf{HoloByte}: a strictly tokenizer-free framework utilizing Continuous Hyperspherical Distillation. HoloByte partitions discrete byte sequences into fixed-capacity chunks and projects them into a continuous, strictly bounded hyperspherical manifold via an invertible, dimension-preserving orthogonal rotation operator. This spatial superposition allows a macroscopic transformer to operate exclusively on compressed continuous representations, formally reducing the exact attention time complexity from $\mathcal{O}(N^2D)$ to $\mathcal{O}\left( \frac{N^2}{W^2}D + ND^2 \right)$. A localized causal micro-decoder subsequently unbinds these representations to compute exact byte-level distributions. To govern this continuous trajectory, we propose a dual-objective formulation incorporating a mathematically precise Holographic Latent Mean Squared Error, which strictly bounds the gradient and guarantees asymptotic stability. Theoretically, we derive the minimal embedding dimension $D = Ω(W \ln |\mathcal{V}|)$ required to ensure error-free discrete recovery from the continuous manifold. Empirically, under strictly matched parameter constraints, HoloByte is systematically outperforming a comparable discrete Byte-Pair Encoding (BPE) baseline. These results establish Continuous Hyperspherical Distillation as a mathematically rigorous and computationally tractable foundation for vocabulary-invariant sequence modeling. The code is available at https://github.com/VladimerKhasia/HoloByte

HoloByte: Continuous Hyperspherical Distillation for Tokenizer-Free Modeling

Abstract

Sequence modeling universally relies on discrete subword tokenization to circumvent the computational intractability of native byte-level attention. However, this heuristic quantization imposes artificial morphological boundaries, enforces vocabulary dependence, and fractures the continuity of the optimization landscape. To resolve this dichotomy, we introduce \textbf{HoloByte}: a strictly tokenizer-free framework utilizing Continuous Hyperspherical Distillation. HoloByte partitions discrete byte sequences into fixed-capacity chunks and projects them into a continuous, strictly bounded hyperspherical manifold via an invertible, dimension-preserving orthogonal rotation operator. This spatial superposition allows a macroscopic transformer to operate exclusively on compressed continuous representations, formally reducing the exact attention time complexity from to . A localized causal micro-decoder subsequently unbinds these representations to compute exact byte-level distributions. To govern this continuous trajectory, we propose a dual-objective formulation incorporating a mathematically precise Holographic Latent Mean Squared Error, which strictly bounds the gradient and guarantees asymptotic stability. Theoretically, we derive the minimal embedding dimension required to ensure error-free discrete recovery from the continuous manifold. Empirically, under strictly matched parameter constraints, HoloByte is systematically outperforming a comparable discrete Byte-Pair Encoding (BPE) baseline. These results establish Continuous Hyperspherical Distillation as a mathematically rigorous and computationally tractable foundation for vocabulary-invariant sequence modeling. The code is available at https://github.com/VladimerKhasia/HoloByte
Paper Structure (18 sections, 4 theorems, 7 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 4 theorems, 7 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

Assume the vectors comprising $\mathbf{M}$ are distributed isotropically on $\mathbb{S}^{D-1}$. The expected squared $L_2$-norm of the interference term is strictly bounded by $\mathbb{E}[\|\boldsymbol{\epsilon}_{t,i}\|_2^2] = \mathcal{O}(1)$.

Figures (2)

  • Figure 1: Empirical evaluation of Absolute Information Compression over continuous training steps. The $y$-axis denotes the normalized average nats per byte. The continuous native representation (HoloByte, red) achieves a strictly lower theoretical entropy bound compared to the discrete BPE transformation (Baseline, blue), demonstrating superior manifold modeling capacity free from the quantization artifacts of tokenization.
  • Figure :

Theorems & Definitions (9)

  • Definition 1: Orthogonal Positional Rotation
  • Lemma 1: Interference Norm Bound
  • proof
  • Theorem 1: Dimensionality Lower Bound for Error-Free Recovery
  • proof
  • Lemma 2: Time Complexity
  • proof
  • Lemma 3: Space Complexity
  • proof