Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago
Andhika Bernard Lumbantobing, Hokky Situngkir
TL;DR
The paper tackles the mismatch between English-centric subword tokenization and the phonological/morphological structure of Austronesian languages. It proposes a syllable-based tokenization framework grounded in aksara, formalizing a segmentation P that maps input text to token sequences of the form $U=(\omega,\nu,\kappa)$ with a vocabulary $\Sigma$ of 2,843 tokens; a three-phase processing pipeline $P = (\phi_{\text{clus}} \circ \phi_{\text{vir}} \circ \phi_{\text{scan}})$ is used, alongside a character-based fallback. Evaluation on the NusaX-MT corpus employs Token per Character (TPC) and Smith-Waterman sequence alignment to assess cross-linguistic preservation, revealing that the syllable-based tokenization consistently preserves linguistic patterns better than GPT-2, achieving an average similarity increase of about $21\%$. The approach yields uniform tokenization behavior across regional Austronesian languages and offers a linguistically principled foundation for multilingual Austronesian LLM development, with future work targeting broader language families and integration with historical manuscripts.
Abstract
Tokenization constitutes a fundamental stage in Large Language Model (LLM) processing; however, subword-based tokenization methods optimized on English-dominant corpora may produce token fragmentation misaligned with the linguistic structures of Austronesian languages. This study aimed to develop a syllable-based tokenization framework adopting principles from traditional Indonesian scripts (aksara) for regional languages of Indonesia. A syllabic segmentation procedure was constructed based on the logic of abugida writing systems and implemented with a vocabulary of 2,843 tokens extracted from the Indonesian dictionary (KBBI). Evaluation was conducted on the NusaX dataset comprising 1,000 parallel translation samples across 10 regional languages, Indonesian, and English. Analysis employed Token per Character (TPC) ratio and sequence alignment using the Smith-Waterman algorithm. Results demonstrated that syllable-based tokenization yielded consistent TPC values across all regional languages, whereas GPT-2 exhibited an inverse pattern with the lowest TPC for English. Syllable-based tokenization consistently produced higher token sequence similarity scores, with an average increase of approximately 21% compared to GPT-2. These findings confirm that the syllable-based approach more effectively preserves phonological and morphological patterns across related Austronesian languages, offering a linguistically principled foundation for multilingual LLM development.
