Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language

Sagar Tamang; Dibya Jyoti Bora

Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language

Sagar Tamang, Dibya Jyoti Bora

TL;DR

This study evaluates tokenizer performance for the Assamese language across five state-of-the-art LLM tokenizers using Vocabulary Size, Average Normalized Sequence Length (NSL), and token counts. NSL provides a compression-focused metric defined as $c_{\\lambda/\\beta} = \\frac{\\sum_{i=1}^N \\mathrm{length}(T_{\\lambda}(D_i))}{\\sum_{i=1}^N \\mathrm{length}(T_{\\beta}(D_i))}$, enabling cross-model comparisons. The main finding is that SUTRA's tokenizer from Two AI achieves the best NSL (0.45) followed by GPT-4o (0.54), Gemma 2 (0.82), Llama 3.1 (1.4), and Mistral Large Instruct 2407 (1.48), reflecting varying coverage and script handling for Assamese. The work provides a public Hugging Face Space for tokenizer comparisons and discusses implications for multilingual, low-resource language processing in NMT and related NLP tasks.

Abstract

Training of a tokenizer plays an important role in the performance of deep learning models. This research aims to understand the performance of tokenizers in five state-of-the-art (SOTA) large language models (LLMs) in the Assamese language of India. The research is important to understand the multi-lingual support for a low-resourced language such as Assamese. Our research reveals that the tokenizer of SUTRA from Two AI performs the best with an average Normalized Sequence Length (NSL) value of 0.45, closely followed by the tokenizer of GPT-4o from Open AI with an average NSL value of 0.54, followed by Gemma 2, Meta Llama 3.1, and Mistral Large Instruct 2407 with an average NSL value of 0.82, 1.4, and 1.48 respectively.

Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language

TL;DR

, enabling cross-model comparisons. The main finding is that SUTRA's tokenizer from Two AI achieves the best NSL (0.45) followed by GPT-4o (0.54), Gemma 2 (0.82), Llama 3.1 (1.4), and Mistral Large Instruct 2407 (1.48), reflecting varying coverage and script handling for Assamese. The work provides a public Hugging Face Space for tokenizer comparisons and discusses implications for multilingual, low-resource language processing in NMT and related NLP tasks.

Abstract

Paper Structure (13 sections, 8 equations, 3 figures, 3 tables)

This paper contains 13 sections, 8 equations, 3 figures, 3 tables.

Introduction
Background
Challenges in Assamese
Literature Review
Tokenization in Low-Resource Languages
Assamese Language
Evaluation Metrics for Tokenizers
Gaps in the Literature
Methodology
Experiments and Results
Discussion
Conclusion
Acknowledgments

Figures (3)

Figure 1: Transformer Architecture from (Vaswani et al., 2017)
Figure 2: Detailed breakdown of the Example text by the SUTRA Tokenizer.
Figure 3: Performance of various LLM Tokenizers on Assamese Text (lower is better)

Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language

TL;DR

Abstract

Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language

Authors

TL;DR

Abstract

Table of Contents

Figures (3)