Table of Contents
Fetching ...

LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control

Mingyu Kang, Hyein Seo, Yuna Jeong, Junhyeong Park, Yong Suk Choi

TL;DR

LogoDiffuser is a training-free method that synthesizes multilingual logo designs using the multimodal diffusion transformer, and integrates character structure and visual design by injecting the most informative attention maps.

Abstract

Recent advances in text-to-image generation have been remarkable, but generating multilingual design logos that harmoniously integrate visual and textual elements remains a challenging task. Existing methods often distort character geometry when applying creative styles and struggle to support multilingual text generation without additional training. To address these challenges, we propose LogoDiffuser, a training-free method that synthesizes multilingual logo designs using the multimodal diffusion transformer. Instead of using textual prompts, we input the target characters as images, enabling robust character structure control regardless of language. We first analyze the joint attention mechanism to identify core tokens, which are tokens that strongly respond to textual structures. With this observation, our method integrates character structure and visual design by injecting the most informative attention maps. Furthermore, we perform layer-wise aggregation of attention maps to mitigate attention shifts across layers and obtain consistent core tokens. Extensive experiments and user studies demonstrate that our method achieves state-of-the-art performance in multilingual logo generation.

LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control

TL;DR

LogoDiffuser is a training-free method that synthesizes multilingual logo designs using the multimodal diffusion transformer, and integrates character structure and visual design by injecting the most informative attention maps.

Abstract

Recent advances in text-to-image generation have been remarkable, but generating multilingual design logos that harmoniously integrate visual and textual elements remains a challenging task. Existing methods often distort character geometry when applying creative styles and struggle to support multilingual text generation without additional training. To address these challenges, we propose LogoDiffuser, a training-free method that synthesizes multilingual logo designs using the multimodal diffusion transformer. Instead of using textual prompts, we input the target characters as images, enabling robust character structure control regardless of language. We first analyze the joint attention mechanism to identify core tokens, which are tokens that strongly respond to textual structures. With this observation, our method integrates character structure and visual design by injecting the most informative attention maps. Furthermore, we perform layer-wise aggregation of attention maps to mitigate attention shifts across layers and obtain consistent core tokens. Extensive experiments and user studies demonstrate that our method achieves state-of-the-art performance in multilingual logo generation.
Paper Structure (25 sections, 13 figures, 3 tables)

This paper contains 25 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Our logo generation results on the MM-DiT architecture, showing high-quality outputs across diverse style prompts. The corresponding style information used for image generation is provided within each prompt, which is displayed below each image.
  • Figure 2: Overview of the proposed LogoDiffuser pipeline. Given an input glyph image $I_s$ and a design prompt $p$, LogoDiffuser selects core tokens from I2I attention within MM-DiT blocks through Core Token selection, and integrates them into the generation process via I2I attention map Injection to ensure that only structure-relevant signals guide the model. Layer-wise Attention Averaging is additionally applied during the injection stage to stabilize structural consistency across layers. These components preserve character shapes faithfully while producing coherent multilingual logo designs.
  • Figure 3: Identifying core tokens through token-wise attention analysis. During glyph image reconstruction, tokens with stronger attention activations concentrate around stroke contours and structural boundaries of the characters. The bottom plot depicts the attention intensity for all tokens, where the highlighted peaks correspond to the most responsive tokens denoted as core token candidates.
  • Figure 4: Visualization results comparing the full attention maps in the upper row and the core token attention maps in the lower row across three languages. The core token attention highlights character strokes and boundaries, effectively preserving textual structure while enabling prompt-driven stylization.
  • Figure 5: Comparison between per-layer attention and cumulative averaged attention. At step 10, (a) individual layer attention maps attend to different visual regions, while (b) the cumulative average maintains consistent focus on the character structure.
  • ...and 8 more figures