Table of Contents
Fetching ...

X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System

Peng Wang, Ruihan Tao, Qiguang Chen, Mengkang Hu, Libo Qin

TL;DR

This work introduces X-WebAgentBench, a multilingual interactive web benchmark with 14 languages, 2,800 multilingual instructions, and 589,946 products to evaluate planning and interaction of language agents across languages. It benchmarks five LLMs and four baselines, revealing that cross-lingual alignment benefits only advanced models and that translating environments to English helps smaller models, while overall multilingual performance lags English. The study identifies token-cost unfairness, imbalanced action usage, and poor performance on long multilingual interactions, attributing these gaps mainly to language alignment rather than reasoning. The benchmark, data, and findings offer practical guidance for developing truly multilingual agentic systems and highlight key directions for future research, including improved multilingual alignment and effective use of translation tools.

Abstract

Recently, large language model (LLM)-based agents have achieved significant success in interactive environments, attracting significant academic and industrial attention. Despite these advancements, current research predominantly focuses on English scenarios. In reality, there are over 7,000 languages worldwide, all of which demand access to comparable agentic services. Nevertheless, the development of language agents remains inadequate for meeting the diverse requirements of multilingual agentic applications. To fill this gap, we introduce X-WebAgentBench, a novel multilingual agent benchmark in an interactive web environment, which evaluates the planning and interaction performance of language agents across multiple languages, thereby contributing to the advancement of global agent intelligence. Additionally, we assess the performance of various LLMs and cross-lingual alignment methods, examining their effectiveness in enhancing agents. Our findings reveal that even advanced models like GPT-4o, when combined with cross-lingual techniques, fail to achieve satisfactory results. We hope that X-WebAgentBench can serve as a valuable benchmark for multilingual agent scenario in real-world applications.

X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System

TL;DR

This work introduces X-WebAgentBench, a multilingual interactive web benchmark with 14 languages, 2,800 multilingual instructions, and 589,946 products to evaluate planning and interaction of language agents across languages. It benchmarks five LLMs and four baselines, revealing that cross-lingual alignment benefits only advanced models and that translating environments to English helps smaller models, while overall multilingual performance lags English. The study identifies token-cost unfairness, imbalanced action usage, and poor performance on long multilingual interactions, attributing these gaps mainly to language alignment rather than reasoning. The benchmark, data, and findings offer practical guidance for developing truly multilingual agentic systems and highlight key directions for future research, including improved multilingual alignment and effective use of translation tools.

Abstract

Recently, large language model (LLM)-based agents have achieved significant success in interactive environments, attracting significant academic and industrial attention. Despite these advancements, current research predominantly focuses on English scenarios. In reality, there are over 7,000 languages worldwide, all of which demand access to comparable agentic services. Nevertheless, the development of language agents remains inadequate for meeting the diverse requirements of multilingual agentic applications. To fill this gap, we introduce X-WebAgentBench, a novel multilingual agent benchmark in an interactive web environment, which evaluates the planning and interaction performance of language agents across multiple languages, thereby contributing to the advancement of global agent intelligence. Additionally, we assess the performance of various LLMs and cross-lingual alignment methods, examining their effectiveness in enhancing agents. Our findings reveal that even advanced models like GPT-4o, when combined with cross-lingual techniques, fail to achieve satisfactory results. We hope that X-WebAgentBench can serve as a valuable benchmark for multilingual agent scenario in real-world applications.

Paper Structure

This paper contains 35 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Comparison of performance in English and multilingual settings on GPT-4o: The English task score statistics presented above are derived from yang2023auto based on the English WebShop benchmark yao2022webshop, while the multilingual task scores are obtained through evaluation on our own benchmark.
  • Figure 2: The construction of X-WebAgentBench includes four stages: (a) Data Preparation, (b) Multilingual Instruction Construction, (c) Multilingual Environment Construction, and (d) Quality Check. This workflow figure refers to M$^3$CoT chen2024m.
  • Figure 3: The distribution of languages and product category in X-WebAgentBench, cyan represents English area, and green represents multilingual area in X-WebAgentBench.
  • Figure 4: Statistics of average input token and output token for BaseAgent method by GPT-3.5-turbo.
  • Figure 5: Statistics of action reward for BaseAgent method by GPT-3.5-turbo.
  • ...and 6 more figures