Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers
Alena Tsanda, Elena Bruches
TL;DR
The paper presents a Russian-language multimodal dataset for automatic summarization of scientific papers, integrating text, tables, and figures across seven domains to capture richer scientific content. It benchmarks two Russian LLMs, Gigachat and YandexGPT, revealing censorship and length-related limitations while showing that YandexGPT often yields higher semantic-quality summaries across multiple metrics. The dataset comprises 420 papers with comprehensive metadata and multimodal content, and is publicly available for benchmarking and research advancement. The work underscores the value of multimodal data for Russian scientific text and sets the stage for broader domain coverage and multimodal summarization improvements.
Abstract
The paper discusses the creation of a multimodal dataset of Russian-language scientific papers and testing of existing language models for the task of automatic text summarization. A feature of the dataset is its multimodal data, which includes texts, tables and figures. The paper presents the results of experiments with two language models: Gigachat from SBER and YandexGPT from Yandex. The dataset consists of 420 papers and is publicly available on https://github.com/iis-research-team/summarization-dataset.
