Searching for Difficult-to-Translate Test Examples at Scale

Wenda Xu; Vilém Zouhar; Parker Riley; Mara Finkelstein; Markus Freitag; Daniel Deutsch

Searching for Difficult-to-Translate Test Examples at Scale

Wenda Xu, Vilém Zouhar, Parker Riley, Mara Finkelstein, Markus Freitag, Daniel Deutsch

TL;DR

This work addresses the need for scalable, difficult evaluation data in NLP by framing the search for challenging topics as a Best Arm Identification problem in a multi-armed bandit, where each topic $t$ is an arm and pulling costs 1 per sample. Difficulty is estimated via $d_x = 100 - \mathrm{QE}$, with the average across models giving the topic's expected difficulty $\mathbb{E}[d_x \mid x \sim t]$, and the objective is to identify the top-$k$ topics under budget $B$. The authors construct a hierarchical, Internet-grounded data pipeline to generate samples from ~3.2k topics and evaluate multiple bandit strategies, finding that $\epsilon$-greedy substantially outperforms brute-force search and approaches near-oracle performance with only a small fraction of the maximum per-topic sampling. Across English-to-Czech, Chinese, German, and Ukrainian, using MT models like Gemma 3 and Gemini 2.5 Pro plus Google Translate, the discovered topics often exceed the difficulty of standard benchmarks such as WMT and FLORES, illustrating the method’s scalability and practical impact for robust MT evaluation. The framework is generalizable to other NLP tasks and online-learning settings, enabling dynamic, scalable creation of tail-test data aligned with model weaknesses.

Abstract

NLP models require test data that are sufficiently challenging. The difficulty of an example is linked to the topic it originates from (''seed topic''). The relationship between the topic and the difficulty of its instances is stochastic in nature: an example about a difficult topic can happen to be easy, and vice versa. At the scale of the Internet, there are tens of thousands of potential topics, and finding the most difficult one by drawing and evaluating a large number of examples across all topics is computationally infeasible. We formalize this task and treat it as a multi-armed bandit problem. In this framework, each topic is an ''arm,'' and pulling an arm (at a cost) involves drawing a single example, evaluating it, and measuring its difficulty. The goal is to efficiently identify the most difficult topics within a fixed computational budget. We illustrate the bandit problem setup of finding difficult examples for the task of machine translation. We find that various bandit strategies vastly outperform baseline methods like brute-force searching the most challenging topics.

Searching for Difficult-to-Translate Test Examples at Scale

TL;DR

This work addresses the need for scalable, difficult evaluation data in NLP by framing the search for challenging topics as a Best Arm Identification problem in a multi-armed bandit, where each topic

is an arm and pulling costs 1 per sample. Difficulty is estimated via

, with the average across models giving the topic's expected difficulty

, and the objective is to identify the top-

topics under budget

. The authors construct a hierarchical, Internet-grounded data pipeline to generate samples from ~3.2k topics and evaluate multiple bandit strategies, finding that

-greedy substantially outperforms brute-force search and approaches near-oracle performance with only a small fraction of the maximum per-topic sampling. Across English-to-Czech, Chinese, German, and Ukrainian, using MT models like Gemma 3 and Gemini 2.5 Pro plus Google Translate, the discovered topics often exceed the difficulty of standard benchmarks such as WMT and FLORES, illustrating the method’s scalability and practical impact for robust MT evaluation. The framework is generalizable to other NLP tasks and online-learning settings, enabling dynamic, scalable creation of tail-test data aligned with model weaknesses.

Searching for Difficult-to-Translate Test Examples at Scale

TL;DR

Abstract

Searching for Difficult-to-Translate Test Examples at Scale

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)