Table of Contents
Fetching ...

Towards Visual Text Design Transfer Across Languages

Yejin Choi, Jiwan Chung, Sumin Shim, Giyeong Oh, Youngjae Yu

TL;DR

SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions, is introduced, a framework for multimodal style translation that achieves superior style consistency and legibility while maintaining visual fidelity, setting itself apart from traditional description-based approaches.

Abstract

Visual text design plays a critical role in conveying themes, emotions, and atmospheres in multimodal formats such as film posters and album covers. Translating these visual and textual elements across languages extends the concept of translation beyond mere text, requiring the adaptation of aesthetic and stylistic features. To address this, we introduce a novel task of Multimodal Style Translation (MuST-Bench), a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems while preserving design intent. Our initial experiments on MuST-Bench reveal that existing visual text generation models struggle with the proposed task due to the inadequacy of textual descriptions in conveying visual design. In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions. SIGIL enhances image generation models through three innovations: glyph latent for multilingual settings, pretrained VAEs for stable style guidance, and an OCR model with reinforcement learning feedback for optimizing readable character generation. SIGIL outperforms existing baselines by achieving superior style consistency and legibility while maintaining visual fidelity, setting itself apart from traditional description-based approaches. We release MuST-Bench publicly for broader use and exploration https://huggingface.co/datasets/yejinc/MuST-Bench.

Towards Visual Text Design Transfer Across Languages

TL;DR

SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions, is introduced, a framework for multimodal style translation that achieves superior style consistency and legibility while maintaining visual fidelity, setting itself apart from traditional description-based approaches.

Abstract

Visual text design plays a critical role in conveying themes, emotions, and atmospheres in multimodal formats such as film posters and album covers. Translating these visual and textual elements across languages extends the concept of translation beyond mere text, requiring the adaptation of aesthetic and stylistic features. To address this, we introduce a novel task of Multimodal Style Translation (MuST-Bench), a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems while preserving design intent. Our initial experiments on MuST-Bench reveal that existing visual text generation models struggle with the proposed task due to the inadequacy of textual descriptions in conveying visual design. In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions. SIGIL enhances image generation models through three innovations: glyph latent for multilingual settings, pretrained VAEs for stable style guidance, and an OCR model with reinforcement learning feedback for optimizing readable character generation. SIGIL outperforms existing baselines by achieving superior style consistency and legibility while maintaining visual fidelity, setting itself apart from traditional description-based approaches. We release MuST-Bench publicly for broader use and exploration https://huggingface.co/datasets/yejinc/MuST-Bench.

Paper Structure

This paper contains 41 sections, 7 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: Generating multilingual visual text following a prompt with typography in the style of an input image from SIGIL for a multilingual film poster. The film poster is from the movie "Red Shoes and the 7 Dwarfs" by Locus Corporation.
  • Figure 2: Overview of our data curation process: (1) For each film poster, we collect multilingual pairs. (2) We manually filter these pairs to retain those with similar typographic styles. (3) Character-level bounding boxes are manually annotated for each image. (4) Finally, we extract the character set pairs.
  • Figure 3: Examples of different language pairs in MuST-Bench. Style translation in MuST-Bench encompasses both (a) instance-level design transfer, illustrated by a snow-covered mountain, and (b) set-level style transfer, featuring a palette of blues, bold and angular fonts, and gradient textures. The film poster is from the movie "Abominable" by DreamWorks Animation, Pearl Studio.
  • Figure 4: SIGIL comprises two main components: the generator and the corrector. (a) the generator combines the style prior and the glyph guide on the VAE representation space to construct the target character and (b) the corrector exploits the off-the-shelf OCR model to optimize the readability of the generated character.
  • Figure 5: Visual comparison of state-of-the-art models and APIs in multilingual visual text generation. All input style images are selected from the MuST-Bench. All generated images were created using unseen text contents.
  • ...and 14 more figures