Table of Contents
Fetching ...

DOoM: Difficult Olympiads of Math

Ilya Kuleshov, Ilin Pavel, Nikolay Kompanets, Ksenia Sycheva, Aleksandr Nikolich

TL;DR

DOoM introduces a Russian-language, open-source benchmark to evaluate multi-step mathematical and physical reasoning in language models, spanning tasks from school-level to Olympiad-level. It comprises two datasets, RussianMath and RussianPhysics, sourced from textbooks and olympiad archives with verifiable solutions and a binary scoring scheme. Initial results show math tasks are generally easier than physics, and the best overall score was Gemini 2.5 Pro with Math $0.874$ and Physics $0.582$; there is a strong positive correlation between the number of reasoning tokens generated and performance ($r = 0.707$, $p < 0.001$), while processing speed has a weaker association. The benchmark is open-source and aims to catalyze research in Russian-domain reasoning and to expand in scope with community collaboration.

Abstract

This paper introduces DOoM, a new open-source benchmark designed to assess the capabilities of language models in solving mathematics and physics problems in Russian. The benchmark includes problems of varying difficulty, ranging from school-level tasks to university Olympiad and entrance exam questions. In this paper we discuss the motivation behind its creation, describe dataset's structure and evaluation methodology, and present initial results from testing various models. Analysis of the results shows a correlation between model performance and the number of tokens used, and highlights differences in performance between mathematics and physics tasks.

DOoM: Difficult Olympiads of Math

TL;DR

DOoM introduces a Russian-language, open-source benchmark to evaluate multi-step mathematical and physical reasoning in language models, spanning tasks from school-level to Olympiad-level. It comprises two datasets, RussianMath and RussianPhysics, sourced from textbooks and olympiad archives with verifiable solutions and a binary scoring scheme. Initial results show math tasks are generally easier than physics, and the best overall score was Gemini 2.5 Pro with Math and Physics ; there is a strong positive correlation between the number of reasoning tokens generated and performance (, ), while processing speed has a weaker association. The benchmark is open-source and aims to catalyze research in Russian-domain reasoning and to expand in scope with community collaboration.

Abstract

This paper introduces DOoM, a new open-source benchmark designed to assess the capabilities of language models in solving mathematics and physics problems in Russian. The benchmark includes problems of varying difficulty, ranging from school-level tasks to university Olympiad and entrance exam questions. In this paper we discuss the motivation behind its creation, describe dataset's structure and evaluation methodology, and present initial results from testing various models. Analysis of the results shows a correlation between model performance and the number of tokens used, and highlights differences in performance between mathematics and physics tasks.

Paper Structure

This paper contains 14 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Examples of mathematics problems from the DOoM dataset, arranged in order of increasing difficulty from left to right: left — Grade 8 school problem; middle — HSE Olympiad, Grade 10; right — MSU Olympiad, Grade 11. All examples are translated from Russian.
  • Figure 2: Correlation between number of generated tokens and benchmark performance. The strong positive relationship (Pearson's $r = 0.707$, $p < 0.001$) indicates that more extensive reasoning is a critical factor for success.
  • Figure 3: Correlation between inference speed (tokens/second) and performance. The moderate positive relationship ($r = 0.531$) suggests a trade-off between reasoning thoroughness and computational efficiency.