Table of Contents
Fetching ...

Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago

Andhika Bernard Lumbantobing, Hokky Situngkir

TL;DR

The paper tackles the mismatch between English-centric subword tokenization and the phonological/morphological structure of Austronesian languages. It proposes a syllable-based tokenization framework grounded in aksara, formalizing a segmentation P that maps input text to token sequences of the form $U=(\omega,\nu,\kappa)$ with a vocabulary $\Sigma$ of 2,843 tokens; a three-phase processing pipeline $P = (\phi_{\text{clus}} \circ \phi_{\text{vir}} \circ \phi_{\text{scan}})$ is used, alongside a character-based fallback. Evaluation on the NusaX-MT corpus employs Token per Character (TPC) and Smith-Waterman sequence alignment to assess cross-linguistic preservation, revealing that the syllable-based tokenization consistently preserves linguistic patterns better than GPT-2, achieving an average similarity increase of about $21\%$. The approach yields uniform tokenization behavior across regional Austronesian languages and offers a linguistically principled foundation for multilingual Austronesian LLM development, with future work targeting broader language families and integration with historical manuscripts.

Abstract

Tokenization constitutes a fundamental stage in Large Language Model (LLM) processing; however, subword-based tokenization methods optimized on English-dominant corpora may produce token fragmentation misaligned with the linguistic structures of Austronesian languages. This study aimed to develop a syllable-based tokenization framework adopting principles from traditional Indonesian scripts (aksara) for regional languages of Indonesia. A syllabic segmentation procedure was constructed based on the logic of abugida writing systems and implemented with a vocabulary of 2,843 tokens extracted from the Indonesian dictionary (KBBI). Evaluation was conducted on the NusaX dataset comprising 1,000 parallel translation samples across 10 regional languages, Indonesian, and English. Analysis employed Token per Character (TPC) ratio and sequence alignment using the Smith-Waterman algorithm. Results demonstrated that syllable-based tokenization yielded consistent TPC values across all regional languages, whereas GPT-2 exhibited an inverse pattern with the lowest TPC for English. Syllable-based tokenization consistently produced higher token sequence similarity scores, with an average increase of approximately 21% compared to GPT-2. These findings confirm that the syllable-based approach more effectively preserves phonological and morphological patterns across related Austronesian languages, offering a linguistically principled foundation for multilingual LLM development.

Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago

TL;DR

The paper tackles the mismatch between English-centric subword tokenization and the phonological/morphological structure of Austronesian languages. It proposes a syllable-based tokenization framework grounded in aksara, formalizing a segmentation P that maps input text to token sequences of the form with a vocabulary of 2,843 tokens; a three-phase processing pipeline is used, alongside a character-based fallback. Evaluation on the NusaX-MT corpus employs Token per Character (TPC) and Smith-Waterman sequence alignment to assess cross-linguistic preservation, revealing that the syllable-based tokenization consistently preserves linguistic patterns better than GPT-2, achieving an average similarity increase of about . The approach yields uniform tokenization behavior across regional Austronesian languages and offers a linguistically principled foundation for multilingual Austronesian LLM development, with future work targeting broader language families and integration with historical manuscripts.

Abstract

Tokenization constitutes a fundamental stage in Large Language Model (LLM) processing; however, subword-based tokenization methods optimized on English-dominant corpora may produce token fragmentation misaligned with the linguistic structures of Austronesian languages. This study aimed to develop a syllable-based tokenization framework adopting principles from traditional Indonesian scripts (aksara) for regional languages of Indonesia. A syllabic segmentation procedure was constructed based on the logic of abugida writing systems and implemented with a vocabulary of 2,843 tokens extracted from the Indonesian dictionary (KBBI). Evaluation was conducted on the NusaX dataset comprising 1,000 parallel translation samples across 10 regional languages, Indonesian, and English. Analysis employed Token per Character (TPC) ratio and sequence alignment using the Smith-Waterman algorithm. Results demonstrated that syllable-based tokenization yielded consistent TPC values across all regional languages, whereas GPT-2 exhibited an inverse pattern with the lowest TPC for English. Syllable-based tokenization consistently produced higher token sequence similarity scores, with an average increase of approximately 21% compared to GPT-2. These findings confirm that the syllable-based approach more effectively preserves phonological and morphological patterns across related Austronesian languages, offering a linguistically principled foundation for multilingual LLM development.
Paper Structure (14 sections, 11 equations, 3 figures, 4 tables)

This paper contains 14 sections, 11 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The frequency distribution of syllable occurrences in the KBBI corpus follows a power law $p(x) \sim x^{-\alpha}$ with coefficient $\alpha \approx 1.87$. Inset: Syllable rank distribution with $\beta = \frac{1}{\alpha-1} \approx 1.14$.
  • Figure 2: Comparison of tokens per character distributions between the syllable-based tokenization method (left) and GPT-2 tokenization (right) across various languages. The red and blue vertical lines represent the mean and median values of the distributions, respectively.
  • Figure 3: Comparative analysis of token sequence similarity values between the syllable-based method and GPT-2. (a) Scatter plot comparing similarity across language pairs. (b) Heatmap of similarity value differences between the two tokenization schemes (Syllable-based -- GPT-2) for each language pair. Red color gradation indicates higher similarity values obtained by the syllable-based method.