Table of Contents
Fetching ...

Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation

Junxin Lu, Tengfei Song, Zhanglin Wu, Pengfei Li, Xiaowei Liang, Hui Yang, Kun Chen, Ning Xie, Yunfei Lu, Jing Zhao, Shiliang Sun, Daimeng Wei

TL;DR

GLoTran is proposed, a global-local dual visual perception framework for MLLM-based TIMT that substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.

Abstract

Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details. Moreover, to realize this dual-perception paradigm, we construct GLoD, a large-scale text-rich TIMT dataset comprising 510K high-resolution global-local image-text pairs covering diverse real-world scenarios. Extensive experiments demonstrate that GLoTran substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.

Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation

TL;DR

GLoTran is proposed, a global-local dual visual perception framework for MLLM-based TIMT that substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.

Abstract

Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details. Moreover, to realize this dual-perception paradigm, we construct GLoD, a large-scale text-rich TIMT dataset comprising 510K high-resolution global-local image-text pairs covering diverse real-world scenarios. Extensive experiments demonstrate that GLoTran substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.
Paper Structure (17 sections, 4 equations, 7 figures, 4 tables)

This paper contains 17 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison of (a) cascade methods, (b) traditional end-to-end models, (c) MLLMs-based models, and (d) our proposed GLoTran for TIMT. Through a dual visual perception strategy integrating global contextual understanding and local textual focus, we enable more complete and accurate TIMT.
  • Figure 2: Overview of the proposed GLoTran framework. (a) The high-resolution input image is processed by a text region detector to identify candidate textual regions, which are subsequently sorted, merged, and cropped into localized slices. (b) The global image and the local slices are fed into GLoTran with a structured prompt, enabling global contextual understanding and local textual focus for TIMT.
  • Figure 3: An illustration of the structured prompt design in GLoTran.
  • Figure 4: Overview of GLoD curation pipeline. The construction pipeline systematically generates high-quality global and local translation pairs through five core stages: conceptualization, data collection and pre-filtering, text region detection and grouping, global-local translation, and quality control.
  • Figure 5: Performance comparison of multilingual translating task with open-source MLLMs on MTIT6 qian2024anytrans.
  • ...and 2 more figures