Table of Contents
Fetching ...

AnyTrans: Translate AnyText in the Image with Large Scale Models

Zhipeng Qian, Pei Zhang, Baosong Yang, Kai Fan, Yiwei Ma, Derek F. Wong, Xiaoshuai Sun, Rongrong Ji

TL;DR

AnyTrans addresses Translate AnyText in the Image (TATI) by integrating OCR-based text localization, cross-modal translation through vision-language models, and diffusion-based text fusion to preserve visual coherence. It employs a three-step pipeline: detect/recognize text with PPOCR, translate using few-shot LLM prompts that preserve textual ordering with <boxidx> tags, and fuse translated text back into the image via a modified AnyText editor with an Anticipated Box Resize strategy. The approach is training-free and open-source, and it is evaluated on MTIT6, a dataset of six language pairs, showing competitive translation quality and superior visual authenticity compared with commercial tools, as evidenced by both human and GPT-4o assessments. The work also introduces MTIT6 to benchmark TATI and discusses future directions to further unify OCR, translation, and text editing in a cohesive pipeline.

Abstract

This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.

AnyTrans: Translate AnyText in the Image with Large Scale Models

TL;DR

AnyTrans addresses Translate AnyText in the Image (TATI) by integrating OCR-based text localization, cross-modal translation through vision-language models, and diffusion-based text fusion to preserve visual coherence. It employs a three-step pipeline: detect/recognize text with PPOCR, translate using few-shot LLM prompts that preserve textual ordering with <boxidx> tags, and fuse translated text back into the image via a modified AnyText editor with an Anticipated Box Resize strategy. The approach is training-free and open-source, and it is evaluated on MTIT6, a dataset of six language pairs, showing competitive translation quality and superior visual authenticity compared with commercial tools, as evidenced by both human and GPT-4o assessments. The work also introduces MTIT6 to benchmark TATI and discusses future directions to further unify OCR, translation, and text editing in a cohesive pipeline.

Abstract

This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
Paper Structure (35 sections, 8 figures, 4 tables)

This paper contains 35 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison between traditional image translation pipeline and our AnyTrans. Our AnyTrans combines image information and context for more accurate translation and generates more realistic text.
  • Figure 2: An overview of AnyTrans. Our translation framework is built around three key components: firstly, Text Detection and Recognition utilizing an offline OCR model; secondly, Text Image Translation using (vision) LLMs; and finally, Text Fusion using the modified AnyText.
  • Figure 3: A prompt example from Korean to Chinese. In Chinese, the order of the two words should be switched.
  • Figure 4: Preprocessing for AnyText is crucial for producing accurate and authentic text, especially in scenarios where there is a significant disparity in text length before and after translation.
  • Figure 5: An example of our MTIT6 dataset, which contains position information of the text in the image, corresponding translation information, and corrected translation order.
  • ...and 3 more figures