Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

Alex Jinpeng Wang; Linjie Li; Zhengyuan Yang; Lijuan Wang; Min Li

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li

TL;DR

This work tackles long-text image generation by identifying tokenization as the main bottleneck in multimodal autoregressive models and introducing TextBinarizer, a text-focused binary tokenizer, together with LongTextAR, a multimodal AR model that integrates TextBinarizer with a Llama2-based decoder. The approach enables high-fidelity, controllable rendering of dense text within images, including font, size, color, and alignment, and demonstrates superior performance over strong baselines like SD3.5 Large and GPT-4o with DALL-E 3 on long-text tasks. Through extensive datasets and ablations, the paper shows that careful tokenization design and co-training on text-rich data yield robust, layout-aware long-text image synthesis with practical applications such as interleaved PowerPoint-style generation. The results highlight the potential for text-centric generation in documents and presentations, while acknowledging remaining challenges in seamless text-in-natural-image integration and stylistic refinement.

Abstract

Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~\cite{sd3} and GPT4o~\cite{gpt4o} with DALL-E 3~\cite{dalle3} in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

TL;DR

Abstract

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)