Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language
Sagar Tamang, Dibya Jyoti Bora
TL;DR
This study evaluates tokenizer performance for the Assamese language across five state-of-the-art LLM tokenizers using Vocabulary Size, Average Normalized Sequence Length (NSL), and token counts. NSL provides a compression-focused metric defined as $c_{\\lambda/\\beta} = \\frac{\\sum_{i=1}^N \\mathrm{length}(T_{\\lambda}(D_i))}{\\sum_{i=1}^N \\mathrm{length}(T_{\\beta}(D_i))}$, enabling cross-model comparisons. The main finding is that SUTRA's tokenizer from Two AI achieves the best NSL (0.45) followed by GPT-4o (0.54), Gemma 2 (0.82), Llama 3.1 (1.4), and Mistral Large Instruct 2407 (1.48), reflecting varying coverage and script handling for Assamese. The work provides a public Hugging Face Space for tokenizer comparisons and discusses implications for multilingual, low-resource language processing in NMT and related NLP tasks.
Abstract
Training of a tokenizer plays an important role in the performance of deep learning models. This research aims to understand the performance of tokenizers in five state-of-the-art (SOTA) large language models (LLMs) in the Assamese language of India. The research is important to understand the multi-lingual support for a low-resourced language such as Assamese. Our research reveals that the tokenizer of SUTRA from Two AI performs the best with an average Normalized Sequence Length (NSL) value of 0.45, closely followed by the tokenizer of GPT-4o from Open AI with an average NSL value of 0.54, followed by Gemma 2, Meta Llama 3.1, and Mistral Large Instruct 2407 with an average NSL value of 0.82, 1.4, and 1.48 respectively.
