Table of Contents
Fetching ...

Searching for Difficult-to-Translate Test Examples at Scale

Wenda Xu, Vilém Zouhar, Parker Riley, Mara Finkelstein, Markus Freitag, Daniel Deutsch

TL;DR

This work addresses the need for scalable, difficult evaluation data in NLP by framing the search for challenging topics as a Best Arm Identification problem in a multi-armed bandit, where each topic $t$ is an arm and pulling costs 1 per sample. Difficulty is estimated via $d_x = 100 - \mathrm{QE}$, with the average across models giving the topic's expected difficulty $\mathbb{E}[d_x \mid x \sim t]$, and the objective is to identify the top-$k$ topics under budget $B$. The authors construct a hierarchical, Internet-grounded data pipeline to generate samples from ~3.2k topics and evaluate multiple bandit strategies, finding that $\epsilon$-greedy substantially outperforms brute-force search and approaches near-oracle performance with only a small fraction of the maximum per-topic sampling. Across English-to-Czech, Chinese, German, and Ukrainian, using MT models like Gemma 3 and Gemini 2.5 Pro plus Google Translate, the discovered topics often exceed the difficulty of standard benchmarks such as WMT and FLORES, illustrating the method’s scalability and practical impact for robust MT evaluation. The framework is generalizable to other NLP tasks and online-learning settings, enabling dynamic, scalable creation of tail-test data aligned with model weaknesses.

Abstract

NLP models require test data that are sufficiently challenging. The difficulty of an example is linked to the topic it originates from (''seed topic''). The relationship between the topic and the difficulty of its instances is stochastic in nature: an example about a difficult topic can happen to be easy, and vice versa. At the scale of the Internet, there are tens of thousands of potential topics, and finding the most difficult one by drawing and evaluating a large number of examples across all topics is computationally infeasible. We formalize this task and treat it as a multi-armed bandit problem. In this framework, each topic is an ''arm,'' and pulling an arm (at a cost) involves drawing a single example, evaluating it, and measuring its difficulty. The goal is to efficiently identify the most difficult topics within a fixed computational budget. We illustrate the bandit problem setup of finding difficult examples for the task of machine translation. We find that various bandit strategies vastly outperform baseline methods like brute-force searching the most challenging topics.

Searching for Difficult-to-Translate Test Examples at Scale

TL;DR

This work addresses the need for scalable, difficult evaluation data in NLP by framing the search for challenging topics as a Best Arm Identification problem in a multi-armed bandit, where each topic is an arm and pulling costs 1 per sample. Difficulty is estimated via , with the average across models giving the topic's expected difficulty , and the objective is to identify the top- topics under budget . The authors construct a hierarchical, Internet-grounded data pipeline to generate samples from ~3.2k topics and evaluate multiple bandit strategies, finding that -greedy substantially outperforms brute-force search and approaches near-oracle performance with only a small fraction of the maximum per-topic sampling. Across English-to-Czech, Chinese, German, and Ukrainian, using MT models like Gemma 3 and Gemini 2.5 Pro plus Google Translate, the discovered topics often exceed the difficulty of standard benchmarks such as WMT and FLORES, illustrating the method’s scalability and practical impact for robust MT evaluation. The framework is generalizable to other NLP tasks and online-learning settings, enabling dynamic, scalable creation of tail-test data aligned with model weaknesses.

Abstract

NLP models require test data that are sufficiently challenging. The difficulty of an example is linked to the topic it originates from (''seed topic''). The relationship between the topic and the difficulty of its instances is stochastic in nature: an example about a difficult topic can happen to be easy, and vice versa. At the scale of the Internet, there are tens of thousands of potential topics, and finding the most difficult one by drawing and evaluating a large number of examples across all topics is computationally infeasible. We formalize this task and treat it as a multi-armed bandit problem. In this framework, each topic is an ''arm,'' and pulling an arm (at a cost) involves drawing a single example, evaluating it, and measuring its difficulty. The goal is to efficiently identify the most difficult topics within a fixed computational budget. We illustrate the bandit problem setup of finding difficult examples for the task of machine translation. We find that various bandit strategies vastly outperform baseline methods like brute-force searching the most challenging topics.

Paper Structure

This paper contains 25 sections, 8 figures, 11 tables, 5 algorithms.

Figures (8)

  • Figure 1: Illustration of our pipeline. Given a large set of all topics, the sampler can draw an example from a topic and estimate its difficulty. The goal of the search algorithm is to find the most difficult topic with as few samplings as possible.
  • Figure 2: Results for algorithms measured with top-1 and top-10 difficulty. All algorithms have the same budget and the cost of a single sampling is 1.
  • Figure 3: Results for algorithms measured with top-10 difficulty on synthetically large $T$. All algorithms have the same budget and the cost of a single difficult estimation is 1.
  • Figure 4: Distribution of topic difficulty ( foogray!40 gray), best fit for Gaussian mixture model distribution with $3$ components ( foored!60 red) used for synthetic scaling, and top-10 empirical oracle ( fooorange!80 orange).
  • Figure 5: Estimated oracle difficulty for topic sizes using a sample generative process as in \ref{['fig:06-scores_across_nodes']}. The synthetic generation is more conservative than the real data (at $|T|=3.2$k the top-1 is 36 and top-10 is 25 from \ref{['fig:03-algorithms_main']}).
  • ...and 3 more figures