Table of Contents
Fetching ...

ML2B: Multi-Lingual ML Benchmark For AutoML

Ekaterina Trofimova, Zosia Shamina, Maria Selifanova, Artem Zaitsev, Remi Savchuk, Maxim Minets, Daria Ozerova, Emil Sataev, Denis Zuenko, Andrey E. Ustyuzhanin

TL;DR

ML²B addresses the gap in multilingual ML code generation benchmarks by translating 30 Kaggle competitions into 13 languages, spanning tabular, text, and image tasks with structured metadata and human-validated translations. It introduces a robust end-to-end evaluation framework (AIDE) and a secure, AST-based code grading pipeline, revealing a substantial 15–45% degradation for non-English prompts across models and domains. The study finds English as the strongest anchor language, with cross-lingual robustness helped by hybrid model configurations but persistent gaps in low-resource languages, especially for tabular tasks. By providing a public benchmark, translation-validation pipeline, and leakage-mitigation strategies, the work calls for translation-aware planning and language-invariant task abstractions to improve reliability of multilingual end-to-end ML pipelines.

Abstract

Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/enaix/ml2b.

ML2B: Multi-Lingual ML Benchmark For AutoML

TL;DR

ML²B addresses the gap in multilingual ML code generation benchmarks by translating 30 Kaggle competitions into 13 languages, spanning tabular, text, and image tasks with structured metadata and human-validated translations. It introduces a robust end-to-end evaluation framework (AIDE) and a secure, AST-based code grading pipeline, revealing a substantial 15–45% degradation for non-English prompts across models and domains. The study finds English as the strongest anchor language, with cross-lingual robustness helped by hybrid model configurations but persistent gaps in low-resource languages, especially for tabular tasks. By providing a public benchmark, translation-validation pipeline, and leakage-mitigation strategies, the work calls for translation-aware planning and language-invariant task abstractions to improve reliability of multilingual end-to-end ML pipelines.

Abstract

Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/enaix/ml2b.

Paper Structure

This paper contains 38 sections, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: Structure of the ML²B benchmark
  • Figure 2: Structure of the code grader
  • Figure 3: Code flow diagram of benchmark submission formats
  • Figure 4: Overall comparison by metrics
  • Figure 5: Overall comparison by domains
  • ...and 9 more figures