Table of Contents
Fetching ...

MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

Artem Chervyakov, Alexander Kharitonov, Pavel Zadorozhny, Adamenko Pavel, Rodion Levichev, Dmitrii Vorobev, Dmitrii Salikhov, Aidar Valeev, Alena Pestova, Maria Dziuba, Ilseyar Alimova, Artem Zavgorodnev, Aleksandr Medvedev, Stanislav Moiseev, Elena Bruches, Daniil Grebenkin, Roman Derunets, Vikulov Vladimir, Anton Emelyanov, Dmitrii Babaev, Vladimir V. Ivanov, Valentin Malykh, Alena Fenogenova

TL;DR

MERA Code introduces a Russian-language benchmark framework for evaluating code generation across 11 tasks and 8 languages, addressing gaps left by NL- and code-centric benchmarks. It defines a taxonomy of model skills, employs diverse prompts, and uses a standardized, instruction-based evaluation pipeline with public scoring and leaderboard support. The paper thoroughly describes the task suite (CodeLinterEval, CodeCorrectness, RealCode/RealCodeJava, JavaTestGen, StRuCom, YABLoCo, RuCodeReviewer, UnitTests, ruHumanEval, ruCodeEval), and reports cross-model results, highlighting language robustness and model specialization. While acknowledging limitations in representativeness and evaluation granularity, MERA Code offers a reproducible, extensible platform to advance practical, multilingual software-engineering AI assessment and development.

Abstract

Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.

MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

TL;DR

MERA Code introduces a Russian-language benchmark framework for evaluating code generation across 11 tasks and 8 languages, addressing gaps left by NL- and code-centric benchmarks. It defines a taxonomy of model skills, employs diverse prompts, and uses a standardized, instruction-based evaluation pipeline with public scoring and leaderboard support. The paper thoroughly describes the task suite (CodeLinterEval, CodeCorrectness, RealCode/RealCodeJava, JavaTestGen, StRuCom, YABLoCo, RuCodeReviewer, UnitTests, ruHumanEval, ruCodeEval), and reports cross-model results, highlighting language robustness and model specialization. While acknowledging limitations in representativeness and evaluation granularity, MERA Code offers a reproducible, extensible platform to advance practical, multilingual software-engineering AI assessment and development.

Abstract

Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.

Paper Structure

This paper contains 48 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Taxonomy of MERA Code encompassing four foundational skills Perception / Knowledge / Reasoning / Generation utilized in the model to address certain tasks. Detailed explanation of each skill could be found in Appendix \ref{['sec:appendx_taxonomy']}.
  • Figure 2: The user path for the submission process on the MERA Code evaluation platform.