Table of Contents
Fetching ...

FonTS: Text Rendering with Typography and Style Controls

Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, Xingxing Zou

TL;DR

The paper tackles the challenge of achieving precise word level typography and style control in diffusion-based text rendering. It introduces a two-stage DiT framework with Typography Control Fine-tuning using enclosing ETC tokens and a Style Control Adapter that decouples content and style, aided by an HTML rendered TC-Dataset for word level supervision. The approach yields superior word level controllability, font consistency, and style consistency across basic, artistic, and scene text tasks, with comprehensive quantitative, qualitative, and ablation evidence. It also provides two new benchmarks and discusses practical applications and limitations, including language drift and content leakage, with potential for multilingual extension.

Abstract

Visual text rendering are widespread in various real-world applications, requiring careful font selection and typographic choices. Recent progress in diffusion transformer (DiT)-based text-to-image (T2I) models show promise in automating these processes. However, these methods still encounter challenges like inconsistent fonts, style variation, and limited fine-grained control, particularly at the word-level. This paper proposes a two-stage DiT-based pipeline to address these problems by enhancing controllability over typography and style in text rendering. We introduce typography control fine-tuning (TC-FT), an parameter-efficient fine-tuning method (on $5\%$ key parameters) with enclosing typography control tokens (ETC-tokens), which enables precise word-level application of typographic features. To further address style inconsistency in text rendering, we propose a text-agnostic style control adapter (SCA) that prevents content leakage while enhancing style consistency. To implement TC-FT and SCA effectively, we incorporated HTML-render into the data synthesis pipeline and proposed the first word-level controllable dataset. Through comprehensive experiments, we demonstrate the effectiveness of our approach in achieving superior word-level typographic control, font consistency, and style consistency in text rendering tasks. The datasets and models will be available for academic use.

FonTS: Text Rendering with Typography and Style Controls

TL;DR

The paper tackles the challenge of achieving precise word level typography and style control in diffusion-based text rendering. It introduces a two-stage DiT framework with Typography Control Fine-tuning using enclosing ETC tokens and a Style Control Adapter that decouples content and style, aided by an HTML rendered TC-Dataset for word level supervision. The approach yields superior word level controllability, font consistency, and style consistency across basic, artistic, and scene text tasks, with comprehensive quantitative, qualitative, and ablation evidence. It also provides two new benchmarks and discusses practical applications and limitations, including language drift and content leakage, with potential for multilingual extension.

Abstract

Visual text rendering are widespread in various real-world applications, requiring careful font selection and typographic choices. Recent progress in diffusion transformer (DiT)-based text-to-image (T2I) models show promise in automating these processes. However, these methods still encounter challenges like inconsistent fonts, style variation, and limited fine-grained control, particularly at the word-level. This paper proposes a two-stage DiT-based pipeline to address these problems by enhancing controllability over typography and style in text rendering. We introduce typography control fine-tuning (TC-FT), an parameter-efficient fine-tuning method (on key parameters) with enclosing typography control tokens (ETC-tokens), which enables precise word-level application of typographic features. To further address style inconsistency in text rendering, we propose a text-agnostic style control adapter (SCA) that prevents content leakage while enhancing style consistency. To implement TC-FT and SCA effectively, we incorporated HTML-render into the data synthesis pipeline and proposed the first word-level controllable dataset. Through comprehensive experiments, we demonstrate the effectiveness of our approach in achieving superior word-level typographic control, font consistency, and style consistency in text rendering tasks. The datasets and models will be available for academic use.

Paper Structure

This paper contains 27 sections, 5 equations, 33 figures, 14 tables.

Figures (33)

  • Figure 1: Text rendering with typography and style controls. The desired style is indicated by an image, and the prompt defines the text content, including font and word-level attributes. The modifier token—< b*> and < \\ b*> for bold, < i*> and < \\ i*> for italic, < u*> and < \\ u*> for underline—enclosed word to denote the application of effects. Results show that our method effectively supports (a) word-level control and style control, (b) style control only, (c) word-level control without compromising the performance of scene text rendering.
  • Figure 2: Framework Overview. In the training phase, (a) illustrates the typography control (TC)-finetuning with paired TC-datasets, and (b) presents the training process for style control adapters (SCA). For inference, (c) shows the integrated operation of the TC-finetuned backbone and the SCA. For simplicity, CLIP is omitted in the figure. The prompt in (a) is '< b*> Find<\\ b*> your path in Font: < font:3>.', and the prompt in (b) is 'Artistic Text: 'Jade', the letters are composed of jade, 3d render, minimalist, high resolution, typography'.
  • Figure 3: Comparative weight changes in the transformer backbone during full parameter fine-tuning. (a) shows that the MM-DiT experiences double the weight changes compared to the Single-DiT. (b) indicates that the Txt-Attn also shows the double weight changes relative to other components within the MM-DiT.
  • Figure 4: Examples of TC-Dataset featuring two types of TC-Tokens. (a) illustrates the TC-token for various font types. (b) displays the ETC-token with word-level typographic attributes applied to a specific word, including bold, italic, and underline.
  • Figure 5: Qualitative results on the font consistency and word-level controls in basic text rendering compared with baselines.
  • ...and 28 more figures