Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Nikita Martynov; Anastasia Mordasheva; Dmitriy Gorbetskiy; Danil Astafurov; Ulyana Isaeva; Elina Basyrova; Sergey Skachkov; Victoria Berestova; Nikolay Ivanov; Valeriia Zanina; Alena Fenogenova

Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Nikita Martynov, Anastasia Mordasheva, Dmitriy Gorbetskiy, Danil Astafurov, Ulyana Isaeva, Elina Basyrova, Sergey Skachkov, Victoria Berestova, Nikolay Ivanov, Valeriia Zanina, Alena Fenogenova

TL;DR

POLLUX delivers an open-source framework for evaluating Russian-language LLMs through a granular, criteria-driven approach. It combines a 35-task Generative Taxonomy with 66 criteria (13 core, plus domain- and task-specific extensions) and expert-annotated prompts to enable interpretable assessment, complemented by LLM-as-a-Judge evaluators (7B and 32B) trained to mimic expert judgments. Across Zero-Shot and Human Dev settings, POLLUX-32B shows strong alignment with human evaluators (up to ~0.73 Spearman), supporting scalable side-by-side evaluation and reducing reliance on costly human comparisons. The work also discusses data generation, cultural specificity for Russian, and environmental considerations, and it provides open access to benchmarks, prompts, and judge models to advance research in language-generation evaluation.

Abstract

We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.

Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

TL;DR

Abstract

Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)