Table of Contents
Fetching ...

MAPS: A Multilingual Benchmark for Agent Performance and Security

Omer Hofman, Jonathan Brokman, Oren Rachmil, Shamik Bose, Vikas Pahuja, Toshiya Shimizu, Trisha Starostina, Kelly Marchisio, Seraphina Goldfarb-Tarrant, Roman Vainshtein

TL;DR

MAPS addresses a critical gap by introducing the first standardized multilingual benchmark suite for agentic AI, extending four established English benchmarks into $12$ languages and compiling $805$ tasks with $9{,}660$ language-specific instances. It demonstrates that multilingual degradation occurs in both performance and safety, particularly in language-heavy tasks, and shows that a hybrid translation pipeline with human verification can mitigate but not erase these gaps. The work highlights the necessity of language-aware evaluation and provides actionable guidelines for equitable, secure deployment of multilingual agentic systems. By publicly releasing MAPS, the authors enable broader benchmarking and drive progress toward robust, accessible agentic AI across languages.

Abstract

Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety, agentic systems risk inheriting these limitations. This raises concerns about the accessibility of such systems, as users interacting in languages other than English may encounter unreliable or security-critical agent behavior. Despite growing interest in evaluating agentic AI and recent initial efforts toward multilingual interaction, existing benchmarks do not yet provide a comprehensive, multi-domain, security-aware evaluation of multilingual agentic systems. To address this gap, we propose MAPS, a multilingual benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks. MAPS builds on four widely used agentic benchmarks - GAIA (real-world tasks), SWE-Bench (code generation), MATH (mathematical reasoning), and the Agent Security Benchmark (security). We translate each dataset into eleven diverse languages, resulting in 805 unique tasks and 9,660 total language-specific instances - enabling a systematic analysis of the Multilingual Effect on AI agents' performance and robustness. Empirically, we observe a degradation in both performance and security when transitioning from English to other languages, with severity varying by task and correlating with the amount of translated input. This work establishes the first standardized evaluation framework for multilingual agentic AI, encouraging future research towards equitable, reliable, and accessible agentic AI. MAPS benchmark suite is publicly available at https://huggingface.co/datasets/Fujitsu-FRE/MAPS

MAPS: A Multilingual Benchmark for Agent Performance and Security

TL;DR

MAPS addresses a critical gap by introducing the first standardized multilingual benchmark suite for agentic AI, extending four established English benchmarks into languages and compiling tasks with language-specific instances. It demonstrates that multilingual degradation occurs in both performance and safety, particularly in language-heavy tasks, and shows that a hybrid translation pipeline with human verification can mitigate but not erase these gaps. The work highlights the necessity of language-aware evaluation and provides actionable guidelines for equitable, secure deployment of multilingual agentic systems. By publicly releasing MAPS, the authors enable broader benchmarking and drive progress toward robust, accessible agentic AI across languages.

Abstract

Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety, agentic systems risk inheriting these limitations. This raises concerns about the accessibility of such systems, as users interacting in languages other than English may encounter unreliable or security-critical agent behavior. Despite growing interest in evaluating agentic AI and recent initial efforts toward multilingual interaction, existing benchmarks do not yet provide a comprehensive, multi-domain, security-aware evaluation of multilingual agentic systems. To address this gap, we propose MAPS, a multilingual benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks. MAPS builds on four widely used agentic benchmarks - GAIA (real-world tasks), SWE-Bench (code generation), MATH (mathematical reasoning), and the Agent Security Benchmark (security). We translate each dataset into eleven diverse languages, resulting in 805 unique tasks and 9,660 total language-specific instances - enabling a systematic analysis of the Multilingual Effect on AI agents' performance and robustness. Empirically, we observe a degradation in both performance and security when transitioning from English to other languages, with severity varying by task and correlating with the amount of translated input. This work establishes the first standardized evaluation framework for multilingual agentic AI, encouraging future research towards equitable, reliable, and accessible agentic AI. MAPS benchmark suite is publicly available at https://huggingface.co/datasets/Fujitsu-FRE/MAPS

Paper Structure

This paper contains 31 sections, 3 equations, 10 figures, 23 tables.

Figures (10)

  • Figure 1: MAPS benchmark suite evaluates LLM-based agents across $12$ languages and $4$ agentic benchmarks covering performance and security.
  • Figure 2: Overview of our multi-stage translation pipeline for agentic benchmark construction. We start with machine translation for structural alignment, followed by LLM-based verification and enhancement. This approach is adapted from ki2024guiding but extended with task-specific prompting and fallback mechanisms tailored to the requirements of agentic AI evaluation.
  • Figure 3: Performance of open-source agents across languages on four agentic benchmarks: GAIA, SWE-Bench, MATH, and ASB. Each bar represents the agent’s accuracy (or attack success rate in ASB) for a given language, with English shown as the baseline. Error bars indicate std across three runs. Performance differences reflect each agent's degradation or resilience in multilingual settings.
  • Figure 4: a) Multilingual Effect as a function of the proportion of translated language tokens in input prompts. Each point represents a benchmark-agent pair, with the Multilingual Effect computed as the average relative degradation in performance or security across non-English languages. The trend suggests a correlation between input translation extent and multilingual vulnerability. b) Two snippets exemplify a low‑translation prompt (MATH) and a high‑translation prompt (GAIA), clarifying the x‑axis percentages in panel (a) and showing how the proportion of natural‑language tokens, rather than task difficulty alone, drives the observed Multilingual Effect.
  • Figure 5: Cross-lingual planning consistency for OpenDeepResearch (ODR) and SWE-Agent. For each agent, we locate tasks solved in English but failed in all other languages, then measure the semantic similarity between the English instruction and each language’s initial planning step using multilingual embeddings. ODR exhibits a strong cross-lingual gap: in 85.7% of cases, its English planning step is more faithful to the instruction than in other languages. In contrast, SWE-Agent is more robust, with English leading in only 25% of cases.
  • ...and 5 more figures