Table of Contents
Fetching ...

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

A. Bochkov

TL;DR

The paper questions the necessity of trainable input embeddings for semantic understanding in Transformer LMs by evaluating a fully frozen embedding regime where tokens are represented by fixed visual Unicode glyphs and a Unicode-centric tokenizer. It introduces a universal embedding pipeline controlled by fixed PCA projections and a bijective tokenization scheme (bvv241), enabling universal text coverage while separating token form from semantic learning. Despite the drastic constraint, the frozen-embedding model converges and, on the MMLU reasoning benchmark under limited data, even outperforms the architecturally identical trainable-embedding baseline, supporting the claim that high-level semantics emerge from the Transformer’s compositional architecture rather than from input Initializations (a phenomenon termed representational interference). The work reframes embeddings as structural primitives, suggests efficiency and cross-lingual benefits, and provides open-source code and models to catalyze further research into architecture-driven semantic emergence.

Abstract

Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

TL;DR

The paper questions the necessity of trainable input embeddings for semantic understanding in Transformer LMs by evaluating a fully frozen embedding regime where tokens are represented by fixed visual Unicode glyphs and a Unicode-centric tokenizer. It introduces a universal embedding pipeline controlled by fixed PCA projections and a bijective tokenization scheme (bvv241), enabling universal text coverage while separating token form from semantic learning. Despite the drastic constraint, the frozen-embedding model converges and, on the MMLU reasoning benchmark under limited data, even outperforms the architecturally identical trainable-embedding baseline, supporting the claim that high-level semantics emerge from the Transformer’s compositional architecture rather than from input Initializations (a phenomenon termed representational interference). The work reframes embeddings as structural primitives, suggests efficiency and cross-lingual benefits, and provides open-source code and models to catalyze further research into architecture-driven semantic emergence.

Abstract

Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

Paper Structure

This paper contains 26 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Average characters per token. Lower values indicate lower compression efficiency. Our bvv241 variants prioritize character-level granularity.
  • Figure 2: Learning curves for frozen vs. trainable embedding models.
  • Figure 3: MMLU and ARC-e curves for frozen vs. trainable embedding models.
  • Figure 4: Performance Comparison: MMLU sub tasks with score greater than 25%.
  • Figure 5: Performance Comparison: Frozen vs. Trainable Embedding and SOTA SmolLM Models.
  • ...and 9 more figures