AnyTrans: Translate AnyText in the Image with Large Scale Models
Zhipeng Qian, Pei Zhang, Baosong Yang, Kai Fan, Yiwei Ma, Derek F. Wong, Xiaoshuai Sun, Rongrong Ji
TL;DR
AnyTrans addresses Translate AnyText in the Image (TATI) by integrating OCR-based text localization, cross-modal translation through vision-language models, and diffusion-based text fusion to preserve visual coherence. It employs a three-step pipeline: detect/recognize text with PPOCR, translate using few-shot LLM prompts that preserve textual ordering with <boxidx> tags, and fuse translated text back into the image via a modified AnyText editor with an Anticipated Box Resize strategy. The approach is training-free and open-source, and it is evaluated on MTIT6, a dataset of six language pairs, showing competitive translation quality and superior visual authenticity compared with commercial tools, as evidenced by both human and GPT-4o assessments. The work also introduces MTIT6 to benchmark TATI and discusses future directions to further unify OCR, translation, and text editing in a cohesive pipeline.
Abstract
This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
