Table of Contents
Fetching ...

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou

TL;DR

Ranni tackles the difficulty of following complex prompts in text-to-image diffusion by introducing a semantic panel that converts text into structured visual concepts via LLMs. This panel then conditions a diffusion model in a panel-to-image step, while a text-to-panel step creates the panel itself, enabling precise object counts, spatial relationships, and attribute binding. The system also supports interactive editing through six unit operations and an automatic data pipeline to train the panel, including LLM-driven chatting-based updates. Experimental results show improved alignment over baselines on prompts requiring detailed composition, along with robust, multi-round, chat-driven editing, suggesting practical utility for accurate image generation and interactive design workflows.

Abstract

Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at https://ranni-t2i.github.io/Ranni.

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

TL;DR

Ranni tackles the difficulty of following complex prompts in text-to-image diffusion by introducing a semantic panel that converts text into structured visual concepts via LLMs. This panel then conditions a diffusion model in a panel-to-image step, while a text-to-panel step creates the panel itself, enabling precise object counts, spatial relationships, and attribute binding. The system also supports interactive editing through six unit operations and an automatic data pipeline to train the panel, including LLM-driven chatting-based updates. Experimental results show improved alignment over baselines on prompts requiring detailed composition, along with robust, multi-round, chat-driven editing, suggesting practical utility for accurate image generation and interactive design workflows.

Abstract

Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at https://ranni-t2i.github.io/Ranni.
Paper Structure (25 sections, 20 figures, 1 table)

This paper contains 25 sections, 20 figures, 1 table.

Figures (20)

  • Figure 1: Samples generated by Ranni with different interaction manners, including (a) direct generation with accurate prompt following, (b) continuous generation with progressive refinement, and (c) chatting-based generation with text instructions.
  • Figure 2: The framework of Ranni for following painting and editing instructions in a sequential workflow based on the semantic panel. (a) The painting task is divided into an LLM-assisted text-to-panel, and a diffusion-based panel-to-image generation. (b) The editing task is conducted via the update of previous semantic panel. (c) The image can be further refined with multi-round compounded editing.
  • Figure 3: Comparison on text-to-image generation between Ranni and representative methods.
  • Figure 4: Comparison on instruction editing between Ranni and representative methods, using unit operation prompts.
  • Figure 5: Samples generated by Ranni on quantity-awareness prompts.
  • ...and 15 more figures