Table of Contents
Fetching ...

Ensuring Consistency for In-Image Translation

Chengpeng Fu, Xiaocheng Feng, Yichong Huang, Wenshuai Huo, Baohang Li, Zhirui Zhang, Yunfei Lu, Dandan Tu, Duyu Tang, Hui Wang, Bing Qin, Ting Liu

TL;DR

This work addresses consistency in in-image translation by introducing HCIIT, a two-stage framework that first translates text within an image using a Multimodal Multilingual LLM with Chain-of-Thought prompts and then backfills the image with a diffusion model that enforces style-consistent typography and preserves background content. It introduces a Style Latent Module and a Glyph Latent Module to condition diffusion, and trains on a large corpus of over 400k style-consistent pseudo-parallel image pairs. Experiments on synthetic and real datasets show improved translation quality and higher style/background coherence compared with online systems and prior methods. The approach offers practical benefits for applications like film poster translation and scene-text translation by delivering translation results that are both linguistically accurate and visually harmonious.

Abstract

The in-image machine translation task involves translating text embedded within images, with the translated results presented in image format. While this task has numerous applications in various scenarios such as film poster translation and everyday scene image translation, existing methods frequently neglect the aspect of consistency throughout this process. We propose the need to uphold two types of consistency in this task: translation consistency and image generation consistency. The former entails incorporating image information during translation, while the latter involves maintaining consistency between the style of the text-image and the original image, ensuring background integrity. To address these consistency requirements, we introduce a novel two-stage framework named HCIIT (High-Consistency In-Image Translation) which involves text-image translation using a multimodal multilingual large language model in the first stage and image backfilling with a diffusion model in the second stage. Chain of thought learning is utilized in the first stage to enhance the model's ability to leverage image information during translation. Subsequently, a diffusion model trained for style-consistent text-image generation ensures uniformity in text style within images and preserves background details. A dataset comprising 400,000 style-consistent pseudo text-image pairs is curated for model training. Results obtained on both curated test sets and authentic image test sets validate the effectiveness of our framework in ensuring consistency and producing high-quality translated images.

Ensuring Consistency for In-Image Translation

TL;DR

This work addresses consistency in in-image translation by introducing HCIIT, a two-stage framework that first translates text within an image using a Multimodal Multilingual LLM with Chain-of-Thought prompts and then backfills the image with a diffusion model that enforces style-consistent typography and preserves background content. It introduces a Style Latent Module and a Glyph Latent Module to condition diffusion, and trains on a large corpus of over 400k style-consistent pseudo-parallel image pairs. Experiments on synthetic and real datasets show improved translation quality and higher style/background coherence compared with online systems and prior methods. The approach offers practical benefits for applications like film poster translation and scene-text translation by delivering translation results that are both linguistically accurate and visually harmonious.

Abstract

The in-image machine translation task involves translating text embedded within images, with the translated results presented in image format. While this task has numerous applications in various scenarios such as film poster translation and everyday scene image translation, existing methods frequently neglect the aspect of consistency throughout this process. We propose the need to uphold two types of consistency in this task: translation consistency and image generation consistency. The former entails incorporating image information during translation, while the latter involves maintaining consistency between the style of the text-image and the original image, ensuring background integrity. To address these consistency requirements, we introduce a novel two-stage framework named HCIIT (High-Consistency In-Image Translation) which involves text-image translation using a multimodal multilingual large language model in the first stage and image backfilling with a diffusion model in the second stage. Chain of thought learning is utilized in the first stage to enhance the model's ability to leverage image information during translation. Subsequently, a diffusion model trained for style-consistent text-image generation ensures uniformity in text style within images and preserves background details. A dataset comprising 400,000 style-consistent pseudo text-image pairs is curated for model training. Results obtained on both curated test sets and authentic image test sets validate the effectiveness of our framework in ensuring consistency and producing high-quality translated images.

Paper Structure

This paper contains 26 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Comparison of the performance of our method, Anytrans and online systems in terms of translation consistency and image generation consistency. "Transfer" in the image should be translated as "中转"(transfer station/transit) not "转移/转换"(convert/shift). The style of the target image should be consistent with the source image. More examples can be find in Figure \ref{['fig:5']}.
  • Figure 2: The process of two-stage in-image translation with consistent style. Our framework consists of two stages, comprising an MMLLM-based TIT and a diffusion-model-based image backfilling.
  • Figure 3: An overview of stage 2. We incorporated a style latent module as a constraint for the Text ControlNet.
  • Figure 4: Case study on En-Zh in-image translation on (a) constructed images and (b) real images with our method, AnyTrans, GoogleTrans and AliyunTrans.
  • Figure 5: Human evaluation and large language model evaluation.
  • ...and 2 more figures