DOoM: Difficult Olympiads of Math
Ilya Kuleshov, Ilin Pavel, Nikolay Kompanets, Ksenia Sycheva, Aleksandr Nikolich
TL;DR
DOoM introduces a Russian-language, open-source benchmark to evaluate multi-step mathematical and physical reasoning in language models, spanning tasks from school-level to Olympiad-level. It comprises two datasets, RussianMath and RussianPhysics, sourced from textbooks and olympiad archives with verifiable solutions and a binary scoring scheme. Initial results show math tasks are generally easier than physics, and the best overall score was Gemini 2.5 Pro with Math $0.874$ and Physics $0.582$; there is a strong positive correlation between the number of reasoning tokens generated and performance ($r = 0.707$, $p < 0.001$), while processing speed has a weaker association. The benchmark is open-source and aims to catalyze research in Russian-domain reasoning and to expand in scope with community collaboration.
Abstract
This paper introduces DOoM, a new open-source benchmark designed to assess the capabilities of language models in solving mathematics and physics problems in Russian. The benchmark includes problems of varying difficulty, ranging from school-level tasks to university Olympiad and entrance exam questions. In this paper we discuss the motivation behind its creation, describe dataset's structure and evaluation methodology, and present initial results from testing various models. Analysis of the results shows a correlation between model performance and the number of tokens used, and highlights differences in performance between mathematics and physics tasks.
