Krutrim LLM: Multilingual Foundational Model for over a Billion People
Aditya Kallappa, Palash Kamble, Abhinav Ravi, Akshat Patidar, Vinayak Dhruv, Deepak Kumar, Raghav Awasthi, Arveti Manjunath, Himanshu Gupta, Shubham Agarwal, Kumar Ashish, Gautam Bhargava, Chandra Khatri
TL;DR
Krutrim LLM introduces a 7B decoder-only Transformer trained on a 2 trillion-token corpus with a specialized Indic tokenizer to address India’s linguistic diversity. The model employs ALiBi for longer contexts, Grouped Query Attention for efficiency, continual pre-training, instruction tuning, and Direct Preference Optimization to align with human preferences, plus a WebRAG pipeline for factual accuracy. It achieves competitive English performance while outperforming state-of-the-art Indic models on multiple benchmarks and nearly matching or exceeding LLama-2 on several tasks, demonstrating robust multilingual fluency across dialects and scripts. Integrated with a conversational interface, Krutrim aims to serve over 1 billion users, emphasizing ethical, context-aware, and culturally aware AI that scales across India's diverse linguistic landscape.
Abstract
India is a diverse society with unique challenges in developing AI systems, including linguistic diversity, oral traditions, data accessibility, and scalability. Existing foundation models are primarily trained on English, limiting their effectiveness for India's population. Indic languages comprise only 1 percent of Common Crawl corpora despite India representing 18 percent of the global population, leading to linguistic biases. Thousands of regional languages, dialects, and code mixing create additional representation challenges due to sparse training data. We introduce Krutrim LLM, a 2 trillion token multilingual model designed for India's linguistic landscape. It incorporates the largest known Indic dataset, mitigating data scarcity and ensuring balanced performance across dialects. Krutrim outperforms or matches state-of-the-art models on Indic benchmarks while maintaining competitive English performance. Despite being significantly smaller in training flops, Krutrim LLM matches or exceeds models like LLAMA-2 on 10 out of 16 tasks, with an average score of 0.57 versus 0.55. This evidences Krutrim's flexible multilingual fluency across diverse linguistic contexts. Krutrim is integrated with real-time search to improve factual accuracy in conversational AI applications. This enhances accessibility for over 1 billion users worldwide. Through intentional design choices addressing data imbalances, Krutrim LLM signifies meaningful progress in building ethical, globally representative AI models.
