Komodo: A Linguistic Expedition into Indonesia's Regional Languages
Louis Owen, Vishesh Tripathi, Abhay Kumar, Biddwan Ahmed
TL;DR
Komodo-7B-Instruct introduces a 7B-parameter LLM tailored to Indonesian and 11 regional languages, leveraging a bilingual alternate parallel training approach and vocabulary expansion to improve cross-language understanding and tokenization. Built on a LoRA-enabled extension of Llama-2-7B-Base, it combines diverse Indonesian textbooks, colloquial data, and carefully curated data with a robust pretraining (8.79B tokens, 3 epochs) and supervised fine-tuning (5 epochs) regime, plus an efficient training setup on 8 A100 GPUs. Evaluations across a broad suite of discriminative, generative, and translation benchmarks show Komodo-7B-Instruct achieving state-of-the-art or near-state-of-the-art results in Indonesian and regional-language tasks, including cross-language translation directly between English and 11 regional languages (surpassing Google Translate in breadth of coverage). The model also demonstrates strong qualitative performance, empathetic instruction-following, and favorable perplexity in Indonesian, underscoring its potential to bridge educational and linguistic gaps in Indonesia. Future work targets larger models (e.g., 13B) and further language coverage while maintaining efficiency and cross-language capabilities.
Abstract
The recent breakthroughs in Large Language Models (LLMs) have mostly focused on languages with easily available and sufficient resources, such as English. However, there remains a significant gap for languages that lack sufficient linguistic resources in the public domain. Our work introduces Komodo-7B, 7-billion-parameter Large Language Models designed to address this gap by seamlessly operating across Indonesian, English, and 11 regional languages in Indonesia. Komodo-7B is a family of LLMs that consist of Komodo-7B-Base and Komodo-7B-Instruct. Komodo-7B-Instruct stands out by achieving state-of-the-art performance in various tasks and languages, outperforming the benchmarks set by OpenAI's GPT-3.5, Cohere's Aya-101, Llama-2-Chat-13B, Mixtral-8x7B-Instruct-v0.1, Gemma-7B-it , and many more. This model not only demonstrates superior performance in both language-specific and overall assessments but also highlights its capability to excel in linguistic diversity. Our commitment to advancing language models extends beyond well-resourced languages, aiming to bridge the gap for those with limited linguistic assets. Additionally, Komodo-7B-Instruct's better cross-language understanding contributes to addressing educational disparities in Indonesia, offering direct translations from English to 11 regional languages, a significant improvement compared to existing language translation services. Komodo-7B represents a crucial step towards inclusivity and effectiveness in language models, providing to the linguistic needs of diverse communities.
