Long Input Benchmark for Russian Analysis
Igor Churin, Murat Apishev, Maria Tikhonova, Denis Shevelev, Aydar Bulatov, Yuri Kuratov, Sergej Averkiev, Alena Fenogenova
TL;DR
LIBRA tackles the lack of a Russian long-context evaluation framework by introducing 21 tasks spanning four complexity levels and a context length range from 4k to 128k tokens. It combines translations of English datasets, Russian adaptations, and new open data, with human-annotated samples and an open-source pipeline plus a public leaderboard. The study evaluates 12 long-context capable LLMs, revealing that longer contexts can both help and hinder depending on task type, while supervised finetuning often yields advantages over pretraining alone. By providing standardized benchmarks, data, and tooling, LIBRA aims to enable robust, reproducible evaluation of Russian LLMs' long-context understanding and guide future model development.
Abstract
Recent advancements in Natural Language Processing (NLP) have fostered the development of Large Language Models (LLMs) that can solve an immense variety of tasks. One of the key aspects of their application is their ability to work with long text documents and to process long sequences of tokens. This has created a demand for proper evaluation of long-context understanding. To address this need for the Russian language, we propose LIBRA (Long Input Benchmark for Russian Analysis), which comprises 21 adapted datasets to study the LLM's abilities to understand long texts thoroughly. The tests are divided into four complexity groups and allow the evaluation of models across various context lengths ranging from 4k up to 128k tokens. We provide the open-source datasets, codebase, and public leaderboard for LIBRA to guide forthcoming research.
