Table of Contents
Fetching ...

Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Nikita Martynov, Anastasia Mordasheva, Dmitriy Gorbetskiy, Danil Astafurov, Ulyana Isaeva, Elina Basyrova, Sergey Skachkov, Victoria Berestova, Nikolay Ivanov, Valeriia Zanina, Alena Fenogenova

TL;DR

POLLUX delivers an open-source framework for evaluating Russian-language LLMs through a granular, criteria-driven approach. It combines a 35-task Generative Taxonomy with 66 criteria (13 core, plus domain- and task-specific extensions) and expert-annotated prompts to enable interpretable assessment, complemented by LLM-as-a-Judge evaluators (7B and 32B) trained to mimic expert judgments. Across Zero-Shot and Human Dev settings, POLLUX-32B shows strong alignment with human evaluators (up to ~0.73 Spearman), supporting scalable side-by-side evaluation and reducing reliance on costly human comparisons. The work also discusses data generation, cultural specificity for Russian, and environmental considerations, and it provides open access to benchmarks, prompts, and judge models to advance research in language-generation evaluation.

Abstract

We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.

Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

TL;DR

POLLUX delivers an open-source framework for evaluating Russian-language LLMs through a granular, criteria-driven approach. It combines a 35-task Generative Taxonomy with 66 criteria (13 core, plus domain- and task-specific extensions) and expert-annotated prompts to enable interpretable assessment, complemented by LLM-as-a-Judge evaluators (7B and 32B) trained to mimic expert judgments. Across Zero-Shot and Human Dev settings, POLLUX-32B shows strong alignment with human evaluators (up to ~0.73 Spearman), supporting scalable side-by-side evaluation and reducing reliance on costly human comparisons. The work also discusses data generation, cultural specificity for Russian, and environmental considerations, and it provides open access to benchmarks, prompts, and judge models to advance research in language-generation evaluation.

Abstract

We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.

Paper Structure

This paper contains 22 sections, 1 equation, 12 figures, 11 tables.

Figures (12)

  • Figure 1: POLLUX overview and rounded statistics before filtering: benchmark characteristics, including tasks and criteria, information about the experts involved in creating the data, the overflow of LLM-as-a-Judge models, and the synthetic data used for them.
  • Figure 2: The POLLUX generative taxonomy of tasks. The labeled figures highlighted in bright colors are major 35 task groups. Each task group is annotated with corresponding expert panels. The sections within task types schematically illustrate the depth of decomposition within each taxon.
  • Figure 3: Names and numbers of language aspects studied in the POLLUX benchmark
  • Figure 4: Survey participant gender distribution. The gender distribution among the benchmark's creators suggests a positive trend towards gender diversity and inclusivity in the field.
  • Figure 5: Survey participant age distribution. The substantial representation of the 25–34 age group highlights the active involvement of professionals who are likely combining fresh academic knowledge with practical experience. The diversity across age groups also shows a collaborative environment with varying levels of experience.
  • ...and 7 more figures