Table of Contents
Fetching ...

Krutrim LLM: Multilingual Foundational Model for over a Billion People

Aditya Kallappa, Palash Kamble, Abhinav Ravi, Akshat Patidar, Vinayak Dhruv, Deepak Kumar, Raghav Awasthi, Arveti Manjunath, Himanshu Gupta, Shubham Agarwal, Kumar Ashish, Gautam Bhargava, Chandra Khatri

TL;DR

Krutrim LLM introduces a 7B decoder-only Transformer trained on a 2 trillion-token corpus with a specialized Indic tokenizer to address India’s linguistic diversity. The model employs ALiBi for longer contexts, Grouped Query Attention for efficiency, continual pre-training, instruction tuning, and Direct Preference Optimization to align with human preferences, plus a WebRAG pipeline for factual accuracy. It achieves competitive English performance while outperforming state-of-the-art Indic models on multiple benchmarks and nearly matching or exceeding LLama-2 on several tasks, demonstrating robust multilingual fluency across dialects and scripts. Integrated with a conversational interface, Krutrim aims to serve over 1 billion users, emphasizing ethical, context-aware, and culturally aware AI that scales across India's diverse linguistic landscape.

Abstract

India is a diverse society with unique challenges in developing AI systems, including linguistic diversity, oral traditions, data accessibility, and scalability. Existing foundation models are primarily trained on English, limiting their effectiveness for India's population. Indic languages comprise only 1 percent of Common Crawl corpora despite India representing 18 percent of the global population, leading to linguistic biases. Thousands of regional languages, dialects, and code mixing create additional representation challenges due to sparse training data. We introduce Krutrim LLM, a 2 trillion token multilingual model designed for India's linguistic landscape. It incorporates the largest known Indic dataset, mitigating data scarcity and ensuring balanced performance across dialects. Krutrim outperforms or matches state-of-the-art models on Indic benchmarks while maintaining competitive English performance. Despite being significantly smaller in training flops, Krutrim LLM matches or exceeds models like LLAMA-2 on 10 out of 16 tasks, with an average score of 0.57 versus 0.55. This evidences Krutrim's flexible multilingual fluency across diverse linguistic contexts. Krutrim is integrated with real-time search to improve factual accuracy in conversational AI applications. This enhances accessibility for over 1 billion users worldwide. Through intentional design choices addressing data imbalances, Krutrim LLM signifies meaningful progress in building ethical, globally representative AI models.

Krutrim LLM: Multilingual Foundational Model for over a Billion People

TL;DR

Krutrim LLM introduces a 7B decoder-only Transformer trained on a 2 trillion-token corpus with a specialized Indic tokenizer to address India’s linguistic diversity. The model employs ALiBi for longer contexts, Grouped Query Attention for efficiency, continual pre-training, instruction tuning, and Direct Preference Optimization to align with human preferences, plus a WebRAG pipeline for factual accuracy. It achieves competitive English performance while outperforming state-of-the-art Indic models on multiple benchmarks and nearly matching or exceeding LLama-2 on several tasks, demonstrating robust multilingual fluency across dialects and scripts. Integrated with a conversational interface, Krutrim aims to serve over 1 billion users, emphasizing ethical, context-aware, and culturally aware AI that scales across India's diverse linguistic landscape.

Abstract

India is a diverse society with unique challenges in developing AI systems, including linguistic diversity, oral traditions, data accessibility, and scalability. Existing foundation models are primarily trained on English, limiting their effectiveness for India's population. Indic languages comprise only 1 percent of Common Crawl corpora despite India representing 18 percent of the global population, leading to linguistic biases. Thousands of regional languages, dialects, and code mixing create additional representation challenges due to sparse training data. We introduce Krutrim LLM, a 2 trillion token multilingual model designed for India's linguistic landscape. It incorporates the largest known Indic dataset, mitigating data scarcity and ensuring balanced performance across dialects. Krutrim outperforms or matches state-of-the-art models on Indic benchmarks while maintaining competitive English performance. Despite being significantly smaller in training flops, Krutrim LLM matches or exceeds models like LLAMA-2 on 10 out of 16 tasks, with an average score of 0.57 versus 0.55. This evidences Krutrim's flexible multilingual fluency across diverse linguistic contexts. Krutrim is integrated with real-time search to improve factual accuracy in conversational AI applications. This enhances accessibility for over 1 billion users worldwide. Through intentional design choices addressing data imbalances, Krutrim LLM signifies meaningful progress in building ethical, globally representative AI models.

Paper Structure

This paper contains 35 sections, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Pre-training data sources.
  • Figure 2: Pre-training loss
  • Figure 3: SFT loss
  • Figure 4: The UMAP projection of embedding for randomly sampled tasks across 4 categories as shown in above figure. (a) The plot shows that embedding projection for the data points are intermixed and tasks are not separable after pre-training stage (b) The plot shows that Krutrim SFT model has better understanding of tasks and is able to separate them out across all task categories.
  • Figure 5: (a) The UMAP plots for Llama2 7B SFT Chat model and (b) The UMAP plots for Krutrim model. In this plot, we compare the embedding projection for Llama2 7B SFT model against Krutrim model and show that Krutrim has better separability of tasks than Llama2 7B SFT model across all tasks while substantially surpassing in categories like "creative writing".
  • ...and 11 more figures