Table of Contents
Fetching ...

Multimodal Evaluation of Russian-language Architectures

Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova, Alexander Kharitonov, Yulia Lyakh, Petr Surovtsev, Denis Shevelev, Vildan Saburov, Vasily Konovalov, Elisei Rykov, Ivan Sviridov, Amina Miftakhova, Ilseyar Alimova, Alexander Panchenko, Alexander Kapitanov, Alena Fenogenova

TL;DR

MERA Multi addresses the lack of Russian multimodal benchmarks by introducing an open, instruction-based framework that evaluates 18 tasks across text, image, audio, and video. It provides a unified skill taxonomy, a standardized evaluation pipeline, and robust data-protection mechanisms (watermarking, MSMIA leakage detection, licensing) to ensure fair, culturally grounded assessment. The paper demonstrates baseline results for open- and closed-source models and establishes a submission platform with automated scoring and leaderboards, highlighting strengths of omni-models and revealing gaps in OCR, temporal reasoning, and ethical/video understanding. By focusing on Russian cultural and linguistic nuances, MERA Multi offers a replicable blueprint for culturally aware multimodal benchmarks in Slavic languages and beyond, with practical implications for evaluating and advancing multilingual multimodal AI systems.

Abstract

Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.

Multimodal Evaluation of Russian-language Architectures

TL;DR

MERA Multi addresses the lack of Russian multimodal benchmarks by introducing an open, instruction-based framework that evaluates 18 tasks across text, image, audio, and video. It provides a unified skill taxonomy, a standardized evaluation pipeline, and robust data-protection mechanisms (watermarking, MSMIA leakage detection, licensing) to ensure fair, culturally grounded assessment. The paper demonstrates baseline results for open- and closed-source models and establishes a submission platform with automated scoring and leaderboards, highlighting strengths of omni-models and revealing gaps in OCR, temporal reasoning, and ethical/video understanding. By focusing on Russian cultural and linguistic nuances, MERA Multi offers a replicable blueprint for culturally aware multimodal benchmarks in Slavic languages and beyond, with practical implications for evaluating and advancing multilingual multimodal AI systems.

Abstract

Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.

Paper Structure

This paper contains 131 sections, 7 equations, 2 figures, 22 tables.

Figures (2)

  • Figure 1: Overview of MERA Multi. The benchmark unites multimodal evaluation, taxonomy-based skill assessment, and data leakage protection across 18 tasks covering (default) text, image, audio, and video modalities. It employs standardized block-prompting, compound scoring, and integrates methods for multimodal content protection, forming a transparent and robust methodology for culturally grounded multimodal evaluation in Russian.
  • Figure 2: The relative (with regard to baseline prompt (0)) effects of different formulations of prompts for each dataset. There are ten different formulations of prompts for one dataset, hence nine corresponding bars (one formulation is baseline category). Red bars reflect statistically significant (at 95% confidence level) effects.