MAPS: A Multilingual Benchmark for Agent Performance and Security

Omer Hofman; Jonathan Brokman; Oren Rachmil; Shamik Bose; Vikas Pahuja; Toshiya Shimizu; Trisha Starostina; Kelly Marchisio; Seraphina Goldfarb-Tarrant; Roman Vainshtein

MAPS: A Multilingual Benchmark for Agent Performance and Security

Omer Hofman, Jonathan Brokman, Oren Rachmil, Shamik Bose, Vikas Pahuja, Toshiya Shimizu, Trisha Starostina, Kelly Marchisio, Seraphina Goldfarb-Tarrant, Roman Vainshtein

TL;DR

MAPS addresses a critical gap by introducing the first standardized multilingual benchmark suite for agentic AI, extending four established English benchmarks into $12$ languages and compiling $805$ tasks with $9{,}660$ language-specific instances. It demonstrates that multilingual degradation occurs in both performance and safety, particularly in language-heavy tasks, and shows that a hybrid translation pipeline with human verification can mitigate but not erase these gaps. The work highlights the necessity of language-aware evaluation and provides actionable guidelines for equitable, secure deployment of multilingual agentic systems. By publicly releasing MAPS, the authors enable broader benchmarking and drive progress toward robust, accessible agentic AI across languages.

Abstract

Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety, agentic systems risk inheriting these limitations. This raises concerns about the accessibility of such systems, as users interacting in languages other than English may encounter unreliable or security-critical agent behavior. Despite growing interest in evaluating agentic AI and recent initial efforts toward multilingual interaction, existing benchmarks do not yet provide a comprehensive, multi-domain, security-aware evaluation of multilingual agentic systems. To address this gap, we propose MAPS, a multilingual benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks. MAPS builds on four widely used agentic benchmarks - GAIA (real-world tasks), SWE-Bench (code generation), MATH (mathematical reasoning), and the Agent Security Benchmark (security). We translate each dataset into eleven diverse languages, resulting in 805 unique tasks and 9,660 total language-specific instances - enabling a systematic analysis of the Multilingual Effect on AI agents' performance and robustness. Empirically, we observe a degradation in both performance and security when transitioning from English to other languages, with severity varying by task and correlating with the amount of translated input. This work establishes the first standardized evaluation framework for multilingual agentic AI, encouraging future research towards equitable, reliable, and accessible agentic AI. MAPS benchmark suite is publicly available at https://huggingface.co/datasets/Fujitsu-FRE/MAPS

MAPS: A Multilingual Benchmark for Agent Performance and Security

TL;DR

Abstract

MAPS: A Multilingual Benchmark for Agent Performance and Security

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)