T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
Björn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, Samuel Weinbach
TL;DR
The paper tackles tokenizer-induced overhead and corpus biases in LLMs by introducing T-Free, a tokenizer-free approach that encodes words as sparse activations over hashed character trigrams. This enables memory-efficient embeddings by forming an embedding matrix of size $v \times h$ from $n \cdot m$ activations per word, where $v$ is the vocabulary size, $n$ is word length, and $m$ is the number of active trigram descriptors, with the system trained via a multi-label BCE loss over $n \cdot m$ activations. Key results show embedding and LM-head parameter counts can be reduced by up to approximately 87.5%, competitive downstream performance at small vocabularies (e.g., $v=8k$), and improved cross-lingual transfer without reliance on a reference corpus. This approach yields memory-efficient LLM backbones, facilitates faster language adaptation, and supports flexible decoding via an exchangeable dictionary, with potential for hybrid tokenization setups in practical deployments.
Abstract
Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.
