Table of Contents
Fetching ...

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

Björn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, Samuel Weinbach

TL;DR

The paper tackles tokenizer-induced overhead and corpus biases in LLMs by introducing T-Free, a tokenizer-free approach that encodes words as sparse activations over hashed character trigrams. This enables memory-efficient embeddings by forming an embedding matrix of size $v \times h$ from $n \cdot m$ activations per word, where $v$ is the vocabulary size, $n$ is word length, and $m$ is the number of active trigram descriptors, with the system trained via a multi-label BCE loss over $n \cdot m$ activations. Key results show embedding and LM-head parameter counts can be reduced by up to approximately 87.5%, competitive downstream performance at small vocabularies (e.g., $v=8k$), and improved cross-lingual transfer without reliance on a reference corpus. This approach yields memory-efficient LLM backbones, facilitates faster language adaptation, and supports flexible decoding via an exchangeable dictionary, with potential for hybrid tokenization setups in practical deployments.

Abstract

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

TL;DR

The paper tackles tokenizer-induced overhead and corpus biases in LLMs by introducing T-Free, a tokenizer-free approach that encodes words as sparse activations over hashed character trigrams. This enables memory-efficient embeddings by forming an embedding matrix of size from activations per word, where is the vocabulary size, is word length, and is the number of active trigram descriptors, with the system trained via a multi-label BCE loss over activations. Key results show embedding and LM-head parameter counts can be reduced by up to approximately 87.5%, competitive downstream performance at small vocabularies (e.g., ), and improved cross-lingual transfer without reliance on a reference corpus. This approach yields memory-efficient LLM backbones, facilitates faster language adaptation, and supports flexible decoding via an exchangeable dictionary, with potential for hybrid tokenization setups in practical deployments.

Abstract

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.
Paper Structure (31 sections, 1 equation, 16 figures, 10 tables, 5 algorithms)

This paper contains 31 sections, 1 equation, 16 figures, 10 tables, 5 algorithms.

Figures (16)

  • Figure 1: Method comparison of classic Tokenization (left) and T-Free (right) for text encoding (top) and decoding (bottom). Classic subword tokenizers learn a single-label vocabulary, i.e. a token is bijectively mapped into a single entry of the vocabulary. Instead, T-Free uses a bijective multi-label mapping over multiple activations of hashed character trigrams. As T-Free explicitly models morphological similarities, it enables compression of the embedding layer.
  • Figure 2: Example of the next word prediction with T-Free. To the list of predictable words of dimension $d$ we generate once the corresponding patterns within the available vocabulary size $v$, as described in the encoding step $2$ of Sec. \ref{['sec:encoding']}. Note how morphologically close words will generate overlapping patterns. The element-wise sigmoid values of the output of the last hidden layer, $\sigma(h)$, is multiplied with this pattern matrix using standard dot product. Finally, we use $h'$ for the sampling process, the average sigmoid value of a word. C.f. App. \ref{['app:mesa_algo']} for further details.
  • Figure 3: Hyperparameter search for Vocab Size of T-Free on a series of 1B ablations. We fixed number of activations $m=10$, and do not apply lowercase overlap ($k=0$). The boxplots show the differences of trained models to a $64k$ unigram baseline for 18 downstream benchmarks (0-shot). T-Free outperforms in median the classical tokenizer architecture with a reduced vocab size of $8k$ entries ($12.5\%$).
  • Figure 4: Continual pre-training performance. Trained are $3B$ models on English slimpajama data for $90k$ steps ("baseline"), and continued on German occiglot data for $20k$ steps. Plotted are the average scores of two benchmarks available in German and English: Hellaswag and Arc-Challenge. Notably, T-Free outperforms in German already with the baseline. Within $20k$ continued steps, T-Free improves by $5\%$ on average in 0 and 2-shot, while the classic tokenizer approach barely improves. Both models slightly drop performance in English, albeit the tokenizer version more drastically. Full evaluations are found in Appendix Tab. \ref{['tab:evalende']},\ref{['tab:evalen1']},\ref{['tab:evalen2']}.
  • Figure 5: Exemplary comparison of classic tokenizer ($v=64k$) training loss curve (top) and T-Free ($v=16k$) training loss (bottom). Overall we noticed less spikey training behavior when using T-Free. Both 3B models were trained on same slimpajama data, token-batchsize and learning rate 4.5e-4.
  • ...and 11 more figures