Komodo: A Linguistic Expedition into Indonesia's Regional Languages

Louis Owen; Vishesh Tripathi; Abhay Kumar; Biddwan Ahmed

Komodo: A Linguistic Expedition into Indonesia's Regional Languages

Louis Owen, Vishesh Tripathi, Abhay Kumar, Biddwan Ahmed

TL;DR

Komodo-7B-Instruct introduces a 7B-parameter LLM tailored to Indonesian and 11 regional languages, leveraging a bilingual alternate parallel training approach and vocabulary expansion to improve cross-language understanding and tokenization. Built on a LoRA-enabled extension of Llama-2-7B-Base, it combines diverse Indonesian textbooks, colloquial data, and carefully curated data with a robust pretraining (8.79B tokens, 3 epochs) and supervised fine-tuning (5 epochs) regime, plus an efficient training setup on 8 A100 GPUs. Evaluations across a broad suite of discriminative, generative, and translation benchmarks show Komodo-7B-Instruct achieving state-of-the-art or near-state-of-the-art results in Indonesian and regional-language tasks, including cross-language translation directly between English and 11 regional languages (surpassing Google Translate in breadth of coverage). The model also demonstrates strong qualitative performance, empathetic instruction-following, and favorable perplexity in Indonesian, underscoring its potential to bridge educational and linguistic gaps in Indonesia. Future work targets larger models (e.g., 13B) and further language coverage while maintaining efficiency and cross-language capabilities.

Abstract

The recent breakthroughs in Large Language Models (LLMs) have mostly focused on languages with easily available and sufficient resources, such as English. However, there remains a significant gap for languages that lack sufficient linguistic resources in the public domain. Our work introduces Komodo-7B, 7-billion-parameter Large Language Models designed to address this gap by seamlessly operating across Indonesian, English, and 11 regional languages in Indonesia. Komodo-7B is a family of LLMs that consist of Komodo-7B-Base and Komodo-7B-Instruct. Komodo-7B-Instruct stands out by achieving state-of-the-art performance in various tasks and languages, outperforming the benchmarks set by OpenAI's GPT-3.5, Cohere's Aya-101, Llama-2-Chat-13B, Mixtral-8x7B-Instruct-v0.1, Gemma-7B-it , and many more. This model not only demonstrates superior performance in both language-specific and overall assessments but also highlights its capability to excel in linguistic diversity. Our commitment to advancing language models extends beyond well-resourced languages, aiming to bridge the gap for those with limited linguistic assets. Additionally, Komodo-7B-Instruct's better cross-language understanding contributes to addressing educational disparities in Indonesia, offering direct translations from English to 11 regional languages, a significant improvement compared to existing language translation services. Komodo-7B represents a crucial step towards inclusivity and effectiveness in language models, providing to the linguistic needs of diverse communities.

Komodo: A Linguistic Expedition into Indonesia's Regional Languages

TL;DR

Abstract

Paper Structure (47 sections, 8 figures, 4 tables)

This paper contains 47 sections, 8 figures, 4 tables.

Introduction
Dataset
Pretraining & Supervised-Fine-Tuning Data
Benchmarking Datasets
Training and Experimental Setup
Expanding the Vocabulary
Optimizing for Efficiency
Training & Finetuning
Evaluation & Results
Tokenizer Fertility Analysis
Embedding Position Analysis
Downstream Tasks
Baselines
Discriminative Tasks
Generative Tasks
...and 32 more sections

Figures (8)

Figure 1: The Evolution of Komodo-7B-Instruct Language Model. The diagram illustrates the transformation from the Komodo-7B-Base model, initially trained on diverse datasets encompassing various languages, to the refined Komodo-7B-Instruct model through targeted Supervised Fine-Tuning (SFT) on specific tasks and domains. The journey involves strategic pretraining on comprehensive datasets, followed by fine-tuning for enhanced performance and adaptability across a spectrum of language-related challenges.
Figure 2: The left plot represents the initial embedding position of words when they are first randomly initialized, while the right plot shows their updated positions after 3 epochs of pre-training. The noticeable grouping of words from the same class in the right plot indicates effective learning and organization of word relationships during pre-training. These plots are created by utilizing PCA with 2 principal components.
Figure 3: Performance breakdown of all models on NusaX-Senti dataset.
Figure 4: A plot illustrating Komodo-7B-Instruct adeptness in balancing generative and discriminative tasks, showcasing strong performance across diverse language challenges.
Figure 5: A comparison between the Google-translate & Komodo-7B-Instruct
...and 3 more figures

Komodo: A Linguistic Expedition into Indonesia's Regional Languages

TL;DR

Abstract

Komodo: A Linguistic Expedition into Indonesia's Regional Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (8)