Table of Contents
Fetching ...

IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation

Jiahao Lyu, Pei Fu, Zhenhang Li, Weichao Zeng, Shaojie Zhan, Jiahui Yang, Can Ma, Yu Zhou, Zhenbo Luo, Jian Luan

TL;DR

IMTBench is presented, a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages and supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between translated text produced by the model and the text rendered in the translated image.

Abstract

End-to-end In-Image Machine Translation (IIMT) aims to convert text embedded within an image into a target language while preserving the original visual context, layout, and rendering style. However, existing IIMT benchmarks are largely synthetic and thus fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness between rendered text and model outputs. To address these shortcomings, we present In-image Machine Translation Benchmark (IMTBench), a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages. IMTBench supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between the translated text produced by the model and the text rendered in the translated image. We benchmark strong commercial cascade systems, and both closed- and open-source unified multi-modal models, and observe large performance gaps across scenarios and languages, especially on natural scenes and resource-limited languages, highlighting substantial headroom for end-to-end image text translation. We hope IMTBench establishes a standardized benchmark to accelerate progress in this emerging task.

IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation

TL;DR

IMTBench is presented, a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages and supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between translated text produced by the model and the text rendered in the translated image.

Abstract

End-to-end In-Image Machine Translation (IIMT) aims to convert text embedded within an image into a target language while preserving the original visual context, layout, and rendering style. However, existing IIMT benchmarks are largely synthetic and thus fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness between rendered text and model outputs. To address these shortcomings, we present In-image Machine Translation Benchmark (IMTBench), a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages. IMTBench supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between the translated text produced by the model and the text rendered in the translated image. We benchmark strong commercial cascade systems, and both closed- and open-source unified multi-modal models, and observe large performance gaps across scenarios and languages, especially on natural scenes and resource-limited languages, highlighting substantial headroom for end-to-end image text translation. We hope IMTBench establishes a standardized benchmark to accelerate progress in this emerging task.
Paper Structure (20 sections, 6 equations, 8 figures, 2 tables)

This paper contains 20 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Comparison of existing IIMT benchmarks (Segpixels tian2023image, Translatotrion-V lan2024translatotron,IIMT30k tian2025exploring, PRIM tian2025prim) and our proposed IMTBench.
  • Figure 2: Overview of the IMTBench dataset construction pipeline. The curation process consists of two main branches. The top branch Document $\&$ Web focuses on multilingual document translation with structured layouts. The medium branch Scene emphasizes instruction-driven editing of scene text in natural images. The bottom PowerPoint focuses on translation in presentation slide scenarios. All branches converge to form the final IMTBench dataset, which supports comprehensive evaluation of in-image machine translation across diverse scenarios.
  • Figure 3: Data samples of IMTBench, which includes 4 main scenarios, 9 languages and 2500 pairs with detailed annotations.
  • Figure 4: Dataset statistics of IMTBench. The balanced coverage of diverse scenarios and languages enables comprehensive evaluation of In-Image machine translation systems under varied visual and linguistic conditions.
  • Figure 5: Example illustrating the automatic evaluation metrics in IMTBench. The left column shows the source image and the ground-truth translated image. The three predicted results (from left to right) are generated by GPT-Image, Qwen-Image, and Tencent. For each prediction, we report metric scores and reasoning.
  • ...and 3 more figures