Table of Contents
Fetching ...

Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

Alhasan Mahmood, Samir Abdaljalil, Hasan Kurban

Abstract

Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge's language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72\%), while Gemini leads in Arabic (51.72\%, $p<0.001$ vs.\ GPT-4o) and Hindi (53.22\%). No single backbone dominates across all languages, and inter-backbone agreement on individual requirement judgments is modest (Fleiss' $κ\leq 0.231$). A controlled ablation further shows that localizing judge-side instructions, not just benchmark content, can be decisive: Hindi satisfaction drops from 42.8\% to 23.2\% under partial localization. These results indicate that language should be treated as an explicit evaluation variable in agentic benchmarks. Full requirement-level judgments and runtime statistics are released for reproducibility.

Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

Abstract

Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge's language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72\%), while Gemini leads in Arabic (51.72\%, vs.\ GPT-4o) and Hindi (53.22\%). No single backbone dominates across all languages, and inter-backbone agreement on individual requirement judgments is modest (Fleiss' ). A controlled ablation further shows that localizing judge-side instructions, not just benchmark content, can be decisive: Hindi satisfaction drops from 42.8\% to 23.2\% under partial localization. These results indicate that language should be treated as an explicit evaluation variable in agentic benchmarks. Full requirement-level judgments and runtime statistics are released for reproducibility.

Paper Structure

This paper contains 21 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Backbone-by-language heatmaps averaged over three developer-agent frameworks. Left: Mean task-level requirement satisfaction (%), using a sequential blue colormap where darker shading indicates higher satisfaction. Right: Task Solve Rate (%), using a sequential orange colormap where darker shading indicates higher completion. Each cell displays the numeric value. The strongest backbone varies across languages: GPT-4o leads in English (44.7%), while Gemini leads in Arabic (51.7%) and Hindi (53.2%), showing that evaluation outcomes depend jointly on language and backbone.
  • Figure 2: Overview of the multilingual Agent-as-a-Judge pipeline. DevAI tasks are executed by three developer-agent frameworks whose workspaces, code, and trajectories are passed to a multilingual judge agent. The judge runs a modular pipeline (Plan $\to$ Locate $\to$ Retrieve $\to$ Ask $\to$ Judge) in each of five languages under six LLM backbones, producing per-requirement satisfaction verdicts.
  • Figure 3: Percentile success rates (%) by language and backbone, averaged over three developer-agent frameworks. Each panel uses a sequential blue colormap (darker = higher fraction of tasks clearing the threshold), except the 100% panel which uses an orange colormap for Task Solve Rate. The four panels show the fraction of tasks exceeding 20%, 50%, 70%, and 100% requirement satisfaction. GPT-4o and Gemini are similar at the 20% threshold, but Gemini separates at harder thresholds, especially in Arabic and Hindi. The full numerical table is provided in Appendix \ref{['app:dense_tables']}.
  • Figure 4: Requirement-type sensitivity (%) across languages and backbones. Each of the six panels shows one backbone as a heatmap, with rows corresponding to six requirement categories and columns to five evaluation languages. A sequential blue colormap encodes satisfaction rate: darker cells indicate higher satisfaction within that requirement type and language. Training and Data Loading are consistently the easiest categories, while Evaluation Metrics and Model Construction show the largest cross-language variation under strong backbones. Weaker backbones (DeepSeek, Qwen) remain near floor across all categories. The full numerical table is provided in Appendix \ref{['app:dense_tables']}.