Table of Contents
Fetching ...

TransBench: Benchmarking Machine Translation for Industrial-Scale Applications

Haijun Li, Tianqi Shi, Zifu Shang, Yuxuan Han, Xueyu Zhao, Hao Wang, Yu Qian, Zhiqiang Qian, Linlong Xu, Minghao Wu, Chenyang Lyu, Longyue Wang, Gongbo Tang, Weihua Luo, Zhao Xu, Kaifu Zhang

TL;DR

TransBench addresses the gap between academic MT benchmarks and real-world industrial translation needs by introducing a three-level framework—Basic Linguistic Competence, Domain-Specific Proficiency, and Culture Adaptation—and a dedicated benchmark optimized for industrial scenarios, starting with international e-commerce. It combines traditional metrics (BLEU, TER, chrF, METEOR) with domain-aware evaluations (Marco-MOS) and culturally sensitive assessments (honorifics, taboo words). The dataset comprises 17,000 professionally translated e-commerce samples across 33 language pairs, plus robust general capability and cultural fidelity data, with an explicit annotation and data-processing pipeline to ensure quality and reproducibility. The approach offers actionable paths for diagnosing and improving MT systems in industry, supported by open-source evaluation tools and a scalable framework for extending to additional domains such as finance and law.

Abstract

Machine translation (MT) has become indispensable for cross-border communication in globalized industries like e-commerce, finance, and legal services, with recent advancements in large language models (LLMs) significantly enhancing translation quality. However, applying general-purpose MT models to industrial scenarios reveals critical limitations due to domain-specific terminology, cultural nuances, and stylistic conventions absent in generic benchmarks. Existing evaluation frameworks inadequately assess performance in specialized contexts, creating a gap between academic benchmarks and real-world efficacy. To address this, we propose a three-level translation capability framework: (1) Basic Linguistic Competence, (2) Domain-Specific Proficiency, and (3) Cultural Adaptation, emphasizing the need for holistic evaluation across these dimensions. We introduce TransBench, a benchmark tailored for industrial MT, initially targeting international e-commerce with 17,000 professionally translated sentences spanning 4 main scenarios and 33 language pairs. TransBench integrates traditional metrics (BLEU, TER) with Marco-MOS, a domain-specific evaluation model, and provides guidelines for reproducible benchmark construction. Our contributions include: (1) a structured framework for industrial MT evaluation, (2) the first publicly available benchmark for e-commerce translation, (3) novel metrics probing multi-level translation quality, and (4) open-sourced evaluation tools. This work bridges the evaluation gap, enabling researchers and practitioners to systematically assess and enhance MT systems for industry-specific needs.

TransBench: Benchmarking Machine Translation for Industrial-Scale Applications

TL;DR

TransBench addresses the gap between academic MT benchmarks and real-world industrial translation needs by introducing a three-level framework—Basic Linguistic Competence, Domain-Specific Proficiency, and Culture Adaptation—and a dedicated benchmark optimized for industrial scenarios, starting with international e-commerce. It combines traditional metrics (BLEU, TER, chrF, METEOR) with domain-aware evaluations (Marco-MOS) and culturally sensitive assessments (honorifics, taboo words). The dataset comprises 17,000 professionally translated e-commerce samples across 33 language pairs, plus robust general capability and cultural fidelity data, with an explicit annotation and data-processing pipeline to ensure quality and reproducibility. The approach offers actionable paths for diagnosing and improving MT systems in industry, supported by open-source evaluation tools and a scalable framework for extending to additional domains such as finance and law.

Abstract

Machine translation (MT) has become indispensable for cross-border communication in globalized industries like e-commerce, finance, and legal services, with recent advancements in large language models (LLMs) significantly enhancing translation quality. However, applying general-purpose MT models to industrial scenarios reveals critical limitations due to domain-specific terminology, cultural nuances, and stylistic conventions absent in generic benchmarks. Existing evaluation frameworks inadequately assess performance in specialized contexts, creating a gap between academic benchmarks and real-world efficacy. To address this, we propose a three-level translation capability framework: (1) Basic Linguistic Competence, (2) Domain-Specific Proficiency, and (3) Cultural Adaptation, emphasizing the need for holistic evaluation across these dimensions. We introduce TransBench, a benchmark tailored for industrial MT, initially targeting international e-commerce with 17,000 professionally translated sentences spanning 4 main scenarios and 33 language pairs. TransBench integrates traditional metrics (BLEU, TER) with Marco-MOS, a domain-specific evaluation model, and provides guidelines for reproducible benchmark construction. Our contributions include: (1) a structured framework for industrial MT evaluation, (2) the first publicly available benchmark for e-commerce translation, (3) novel metrics probing multi-level translation quality, and (4) open-sourced evaluation tools. This work bridges the evaluation gap, enabling researchers and practitioners to systematically assess and enhance MT systems for industry-specific needs.

Paper Structure

This paper contains 38 sections, 4 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: The Overall Framework of TransBench Benchmark and Evaluate Matrices. The Benchmark Covers Multilingual Sentences and the Evaluation Matrices are Fair and Comprehensive.
  • Figure 2: Data sample in the e-commerce field, which translated from Chinese to English. The text is sampled from the product listing-product description category. The source text represents the original input text, and the target text represents the text translated by the language expert.
  • Figure 3: Data sample in the e-commerce field, which translated from Spanish to English. The text is sampled from the product customer reviews category. The source text represents the original input text, and the target text represents the text translated by the language expert.
  • Figure 4: Sample data of General capality translation. This sample shows the data robust with the pattern of sentence-level-disordered word. It can be seen that the hacked source text scrambles some words from source text. It is hoped that the translation model can maintain the accuracy of the translation results.
  • Figure 5: Sample data showing General capality translation. This sample shows the robust interference method of word-level terminology mixture. It can be seen that the Hacked Source Text has rewritten some terms in the Source Text at the semantic level.
  • ...and 3 more figures