Table of Contents
Fetching ...

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li

TL;DR

This work tackles long-text image generation by identifying tokenization as the main bottleneck in multimodal autoregressive models and introducing TextBinarizer, a text-focused binary tokenizer, together with LongTextAR, a multimodal AR model that integrates TextBinarizer with a Llama2-based decoder. The approach enables high-fidelity, controllable rendering of dense text within images, including font, size, color, and alignment, and demonstrates superior performance over strong baselines like SD3.5 Large and GPT-4o with DALL-E 3 on long-text tasks. Through extensive datasets and ablations, the paper shows that careful tokenization design and co-training on text-rich data yield robust, layout-aware long-text image synthesis with practical applications such as interleaved PowerPoint-style generation. The results highlight the potential for text-centric generation in documents and presentations, while acknowledging remaining challenges in seamless text-in-natural-image integration and stylistic refinement.

Abstract

Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~\cite{sd3} and GPT4o~\cite{gpt4o} with DALL-E 3~\cite{dalle3} in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

TL;DR

This work tackles long-text image generation by identifying tokenization as the main bottleneck in multimodal autoregressive models and introducing TextBinarizer, a text-focused binary tokenizer, together with LongTextAR, a multimodal AR model that integrates TextBinarizer with a Llama2-based decoder. The approach enables high-fidelity, controllable rendering of dense text within images, including font, size, color, and alignment, and demonstrates superior performance over strong baselines like SD3.5 Large and GPT-4o with DALL-E 3 on long-text tasks. Through extensive datasets and ablations, the paper shows that careful tokenization design and co-training on text-rich data yield robust, layout-aware long-text image synthesis with practical applications such as interleaved PowerPoint-style generation. The results highlight the potential for text-centric generation in documents and presentations, while acknowledging remaining challenges in seamless text-in-natural-image integration and stylistic refinement.

Abstract

Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~\cite{sd3} and GPT4o~\cite{gpt4o} with DALL-E 3~\cite{dalle3} in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.

Paper Structure

This paper contains 38 sections, 5 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Breaking the Limits: Long-Text Image Generation Remains Elusive for Existing Models. State-of-the-art text rendering models, such as Text Diffusion 2 textdiffuser2 and AnyText anytext, perform well on short text but struggle with longer passages. Large diffusion models like Stable Diffusion 3.5 Large sd3 can handle longer text but exhibit lower accuracy. The text recognition on generated images was conducted using Qwen2-VL qwen2_vl model. For this evaluation, we sampled 140 examples from the interleaved Obelics obelics dataset with truncation.
  • Figure 2: TextBinarizer implementation details. This approach allows for direct quantization.
  • Figure 3: The main pipeline of LongTextAR. Our trained text-focused tokenizer converts the long-text image into discrete token IDs. A corresponding long-text prompt is generated, and the model is then tasked with predicting the image token IDs based on this long text prompt.
  • Figure 4: Tokenizer reconstruction comparison on data with long-text. Comparing with well-trained VQ tokenizer from Chameleon chameleon, our text-focus tokenizer leads to better reconstruction result on detail generation for letters.
  • Figure 5: Controllable experiment, we modify the text font type, text color and text rotation degree, also the alignment way.
  • ...and 8 more figures