Table of Contents
Fetching ...

IndicEval-XL: Bridging Linguistic Diversity in Code Generation Across Indic Languages

Ujjwal Singh, Aditi Sharma, Nikhil Gupta, Deepakshi, Vivek Kumar Jha

TL;DR

IndicEval-XL addresses the gap in multilingual code-generation benchmarks by introducing a cross-lingual, open benchmark covering 6 Indic languages and 12 programming languages. It uses CodeBERTScore and BERTScore for semantic alignment, with rigorous translation and quality-control pipelines to ensure high-quality parallel NL-PL pairs. The study analyzes three models (Gemini 1.5, Gemini 2.0, LLaMA 7B) and reveals language-specific strengths and weaknesses, notably lower performance for Sanskrit due to data scarcity, and criticizes pass@k as an insufficient metric for small models. Overall, IndicEval-XL advances inclusive evaluation of code generation across diverse languages and supports future improvements in metrics, data resources, and model adaptation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation from natural language prompts, revolutionizing software development workflows. As we advance towards agent-based development paradigms, these models form the cornerstone of next-generation software development lifecycles. However, current benchmarks for evaluating multilingual code generation capabilities are predominantly English-centric, limiting their applicability across the global developer community. To address this limitation, we present IndicEval-XL, a comprehensive benchmark for code generation that incorporates 6 major Indic languages, collectively spoken by approximately 14\% of the world's population. Our benchmark bridges these languages with 12 programming languages, creating a robust evaluation framework. This work is particularly significant given India's representation of one-eighth of the global population and the crucial role Indic languages play in Indian society. IndicEval-XL represents a significant step toward expanding the linguistic diversity in code generation systems and evaluation frameworks. By developing resources that support multiple languages, we aim to make AI-powered development tools more inclusive and accessible to developers of various linguistic backgrounds. To facilitate further research and development in this direction, we make our dataset and evaluation benchmark publicly available at https://github.com/telekom/IndicEval-XL

IndicEval-XL: Bridging Linguistic Diversity in Code Generation Across Indic Languages

TL;DR

IndicEval-XL addresses the gap in multilingual code-generation benchmarks by introducing a cross-lingual, open benchmark covering 6 Indic languages and 12 programming languages. It uses CodeBERTScore and BERTScore for semantic alignment, with rigorous translation and quality-control pipelines to ensure high-quality parallel NL-PL pairs. The study analyzes three models (Gemini 1.5, Gemini 2.0, LLaMA 7B) and reveals language-specific strengths and weaknesses, notably lower performance for Sanskrit due to data scarcity, and criticizes pass@k as an insufficient metric for small models. Overall, IndicEval-XL advances inclusive evaluation of code generation across diverse languages and supports future improvements in metrics, data resources, and model adaptation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation from natural language prompts, revolutionizing software development workflows. As we advance towards agent-based development paradigms, these models form the cornerstone of next-generation software development lifecycles. However, current benchmarks for evaluating multilingual code generation capabilities are predominantly English-centric, limiting their applicability across the global developer community. To address this limitation, we present IndicEval-XL, a comprehensive benchmark for code generation that incorporates 6 major Indic languages, collectively spoken by approximately 14\% of the world's population. Our benchmark bridges these languages with 12 programming languages, creating a robust evaluation framework. This work is particularly significant given India's representation of one-eighth of the global population and the crucial role Indic languages play in Indian society. IndicEval-XL represents a significant step toward expanding the linguistic diversity in code generation systems and evaluation frameworks. By developing resources that support multiple languages, we aim to make AI-powered development tools more inclusive and accessible to developers of various linguistic backgrounds. To facilitate further research and development in this direction, we make our dataset and evaluation benchmark publicly available at https://github.com/telekom/IndicEval-XL

Paper Structure

This paper contains 25 sections, 3 figures, 26 tables.

Figures (3)

  • Figure 1: IndicEval-XL Data Creation Illustration. This diagram explains end to end process of creating and Validating the IndicEval-XL dataset, across all the 6 Indic Languages and English. In quality check 1, along with bertscore (major) we have also done minor checks using (BLEU $>$ 25) and (METEOR $>$ 0.5) for the back translated text (Using GPT-4) for quality verification.
  • Figure 2: The above figure shows the codeBERTScore of each of the 6 Indic languages across 12 programming languages. These performance scores are of 3 LLMs i.e. Gemini 1.5, Gemini 2.0 Flash Thinking and Llama 7b. Languages are ordered randonly in the given graph.
  • Figure 3: Translation examples across different natural languages for TypeScript.