HoloByte: Continuous Hyperspherical Distillation for Tokenizer-Free Modeling

Vladimer Khasia

HoloByte: Continuous Hyperspherical Distillation for Tokenizer-Free Modeling

Vladimer Khasia

Abstract

Sequence modeling universally relies on discrete subword tokenization to circumvent the $\mathcal{O}(N^2)$ computational intractability of native byte-level attention. However, this heuristic quantization imposes artificial morphological boundaries, enforces vocabulary dependence, and fractures the continuity of the optimization landscape. To resolve this dichotomy, we introduce \textbf{HoloByte}: a strictly tokenizer-free framework utilizing Continuous Hyperspherical Distillation. HoloByte partitions discrete byte sequences into fixed-capacity chunks and projects them into a continuous, strictly bounded hyperspherical manifold via an invertible, dimension-preserving orthogonal rotation operator. This spatial superposition allows a macroscopic transformer to operate exclusively on compressed continuous representations, formally reducing the exact attention time complexity from $\mathcal{O}(N^2D)$ to $\mathcal{O}\left( \frac{N^2}{W^2}D + ND^2 \right)$. A localized causal micro-decoder subsequently unbinds these representations to compute exact byte-level distributions. To govern this continuous trajectory, we propose a dual-objective formulation incorporating a mathematically precise Holographic Latent Mean Squared Error, which strictly bounds the gradient and guarantees asymptotic stability. Theoretically, we derive the minimal embedding dimension $D = Ω(W \ln |\mathcal{V}|)$ required to ensure error-free discrete recovery from the continuous manifold. Empirically, under strictly matched parameter constraints, HoloByte is systematically outperforming a comparable discrete Byte-Pair Encoding (BPE) baseline. These results establish Continuous Hyperspherical Distillation as a mathematically rigorous and computationally tractable foundation for vocabulary-invariant sequence modeling. The code is available at https://github.com/VladimerKhasia/HoloByte

HoloByte: Continuous Hyperspherical Distillation for Tokenizer-Free Modeling

Abstract

Sequence modeling universally relies on discrete subword tokenization to circumvent the

computational intractability of native byte-level attention. However, this heuristic quantization imposes artificial morphological boundaries, enforces vocabulary dependence, and fractures the continuity of the optimization landscape. To resolve this dichotomy, we introduce \textbf{HoloByte}: a strictly tokenizer-free framework utilizing Continuous Hyperspherical Distillation. HoloByte partitions discrete byte sequences into fixed-capacity chunks and projects them into a continuous, strictly bounded hyperspherical manifold via an invertible, dimension-preserving orthogonal rotation operator. This spatial superposition allows a macroscopic transformer to operate exclusively on compressed continuous representations, formally reducing the exact attention time complexity from

. A localized causal micro-decoder subsequently unbinds these representations to compute exact byte-level distributions. To govern this continuous trajectory, we propose a dual-objective formulation incorporating a mathematically precise Holographic Latent Mean Squared Error, which strictly bounds the gradient and guarantees asymptotic stability. Theoretically, we derive the minimal embedding dimension

required to ensure error-free discrete recovery from the continuous manifold. Empirically, under strictly matched parameter constraints, HoloByte is systematically outperforming a comparable discrete Byte-Pair Encoding (BPE) baseline. These results establish Continuous Hyperspherical Distillation as a mathematically rigorous and computationally tractable foundation for vocabulary-invariant sequence modeling. The code is available at https://github.com/VladimerKhasia/HoloByte

Paper Structure (18 sections, 4 theorems, 7 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 4 theorems, 7 equations, 2 figures, 2 tables, 1 algorithm.

Introduction
Methodology
Problem Formulation and Mathematical Foundation
Continuous Hyperspherical Distillation (HoloByte)
Holographic Encoding
Hyperspherical Unbinding and Micro-Decoding
Dual-Objective Loss Formulation
Inference
Algorithm Specification
Theoretical Bounds on Superposition Capacity and Scaling Constraints
Asymptotic Scaling Laws for Extradimensional Manifolds
Complexity Analysis
Experimental Setup and Evaluation
Corpus Definition and Optimization Configuration
Architectural Configurations and Parameter Parity
...and 3 more sections

Key Result

Lemma 1

Assume the vectors comprising $\mathbf{M}$ are distributed isotropically on $\mathbb{S}^{D-1}$. The expected squared $L_2$-norm of the interference term is strictly bounded by $\mathbb{E}[\|\boldsymbol{\epsilon}_{t,i}\|_2^2] = \mathcal{O}(1)$.

Figures (2)

Figure 1: Empirical evaluation of Absolute Information Compression over continuous training steps. The $y$-axis denotes the normalized average nats per byte. The continuous native representation (HoloByte, red) achieves a strictly lower theoretical entropy bound compared to the discrete BPE transformation (Baseline, blue), demonstrating superior manifold modeling capacity free from the quantization artifacts of tokenization.
Figure :

Theorems & Definitions (9)

Definition 1: Orthogonal Positional Rotation
Lemma 1: Interference Norm Bound
proof
Theorem 1: Dimensionality Lower Bound for Error-Free Recovery
proof
Lemma 2: Time Complexity
proof
Lemma 3: Space Complexity
proof

HoloByte: Continuous Hyperspherical Distillation for Tokenizer-Free Modeling

Abstract

HoloByte: Continuous Hyperspherical Distillation for Tokenizer-Free Modeling

Authors

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (9)