Table of Contents
Fetching ...

MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning

Zhaopeng Feng, Yupu Liang, Shaosheng Cao, Jiayuan Su, Jiahan Ren, Zhe Xu, Yao Hu, Wenxuan Huang, Jian Wu, Zuozhu Liu

TL;DR

This paper targets Text Image Machine Translation (TIMT) by proposing MT$^{3}$, a multi-task reinforcement learning framework that enables end-to-end TIMT with Multimodal Large Language Models (MLLMs). MT$^{3}$ explicitly decomposes TIMT into text recognition, context-aware reasoning, and translation, guided by a multi-mixed reward and trained via Group Relative Policy Optimization (GRPO). The authors achieve state-of-the-art results on MIT-10M for English-Chinese and Chinese-English and demonstrate strong out-of-distribution generalization, including a new real-world social media TIMT benchmark, XHSPost. They provide extensive analyses on initialization strategies, curriculum learning, and reward design, offering practical guidance for RL-driven TIMT and contributing a valuable real-world benchmark for social media TIMT research. Overall, the work advances end-to-end TIMT with MLLMs and RL, enabling more accurate cross-cultural information access in real-world images and posts.

Abstract

Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT$^{3}$, the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT$^{3}$ adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that adapts rule-based RL strategies to TIMT's intricacies, offering fine-grained, non-binary feedback across tasks. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT$^{3}$-7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.

MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning

TL;DR

This paper targets Text Image Machine Translation (TIMT) by proposing MT, a multi-task reinforcement learning framework that enables end-to-end TIMT with Multimodal Large Language Models (MLLMs). MT explicitly decomposes TIMT into text recognition, context-aware reasoning, and translation, guided by a multi-mixed reward and trained via Group Relative Policy Optimization (GRPO). The authors achieve state-of-the-art results on MIT-10M for English-Chinese and Chinese-English and demonstrate strong out-of-distribution generalization, including a new real-world social media TIMT benchmark, XHSPost. They provide extensive analyses on initialization strategies, curriculum learning, and reward design, offering practical guidance for RL-driven TIMT and contributing a valuable real-world benchmark for social media TIMT research. Overall, the work advances end-to-end TIMT with MLLMs and RL, enabling more accurate cross-cultural information access in real-world images and posts.

Abstract

Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT, the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that adapts rule-based RL strategies to TIMT's intricacies, offering fine-grained, non-binary feedback across tasks. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT-7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.

Paper Structure

This paper contains 24 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustrative TIMT examples from the XHSPost benchmark. MT$^{3}$ demonstrates superior contextual understanding by correctly translating 'Friends' to '老友记' based on visual context (Top), and accurately interpreting a property advertisement, including price and layout details (Bottom).
  • Figure 2: Impact of multi-task ablation on reward optimization and performance progression. Top row: Progression of Final Reward and individual task rewards (Format, Recognition, Translation). Bottom row: Progression of translation quality metrics (BLEU, chrF++, METEOR).
  • Figure 3: Training dynamics comparing Zero-start RL (MT$^{3}$-7B-Zero) vs. SFT initialization (MT$^{3}$-7B-QVQ-Distill). Left and Center: Average metric score progression on MIT-10M ZH-EN and EN-ZH test sets. Right: Average response length during RL training.
  • Figure 4: Influence of curriculum learning strategies on training dynamics and performance. Left: Average response length dynamics for Shuffle, Ascending (easy-to-hard), and Descending (hard-to-easy) difficulty curricula during training. Right: Final response length and BLEU scores on MIT-10M difficulty splits (Easy, Medium, Hard) for EN-ZH and ZH-EN..
  • Figure 5: Analysis of translation rewards. Left: Kendall and Spearman correlation matrices between different individual metric rewards (chrF++, METEOR, BLEU) and the Mixed Reward, based on final reward. Right: Progression of Translation Reward and Final Reward when optimizing for individual metric reward versus the Mixed Reward.
  • ...and 3 more figures