Table of Contents
Fetching ...

Controllable Text-to-Image Generation with GPT-4

Tianjun Zhang, Yi Zhang, Vibhav Vineet, Neel Joshi, Xin Wang

TL;DR

This work tackles the challenge of precise instruction following in diffusion-based text-to-image generation, particularly for spatial layouts. It introduces Control-GPT, which uses GPT-4 to generate programmatic TikZ sketches and grounding tokens that guide a finetuned ControlNet, bridging language and vision for better controllability. To train the system, the authors convert LVIS instance masks into polygons to create polygon-based sketches aligned with COCO captions, yielding a dataset of roughly 120k image-caption-sketch triplets. On the Visor spatial benchmark, Control-GPT achieves state-of-the-art performance, nearly doubling prior methods in spatial accuracy, and human evaluations confirm improved handling of complex scenes, underscoring the potential of integrating LLMs into computer vision pipelines.

Abstract

Current text-to-image generation models often struggle to follow textual instructions, especially the ones requiring spatial reasoning. On the other hand, Large Language Models (LLMs), such as GPT-4, have shown remarkable precision in generating code snippets for sketching out text inputs graphically, e.g., via TikZ. In this work, we introduce Control-GPT to guide the diffusion-based text-to-image pipelines with programmatic sketches generated by GPT-4, enhancing their abilities for instruction following. Control-GPT works by querying GPT-4 to write TikZ code, and the generated sketches are used as references alongside the text instructions for diffusion models (e.g., ControlNet) to generate photo-realistic images. One major challenge to training our pipeline is the lack of a dataset containing aligned text, images, and sketches. We address the issue by converting instance masks in existing datasets into polygons to mimic the sketches used at test time. As a result, Control-GPT greatly boosts the controllability of image generation. It establishes a new state-of-art on the spatial arrangement and object positioning generation and enhances users' control of object positions, sizes, etc., nearly doubling the accuracy of prior models. Our work, as a first attempt, shows the potential for employing LLMs to enhance the performance in computer vision tasks.

Controllable Text-to-Image Generation with GPT-4

TL;DR

This work tackles the challenge of precise instruction following in diffusion-based text-to-image generation, particularly for spatial layouts. It introduces Control-GPT, which uses GPT-4 to generate programmatic TikZ sketches and grounding tokens that guide a finetuned ControlNet, bridging language and vision for better controllability. To train the system, the authors convert LVIS instance masks into polygons to create polygon-based sketches aligned with COCO captions, yielding a dataset of roughly 120k image-caption-sketch triplets. On the Visor spatial benchmark, Control-GPT achieves state-of-the-art performance, nearly doubling prior methods in spatial accuracy, and human evaluations confirm improved handling of complex scenes, underscoring the potential of integrating LLMs into computer vision pipelines.

Abstract

Current text-to-image generation models often struggle to follow textual instructions, especially the ones requiring spatial reasoning. On the other hand, Large Language Models (LLMs), such as GPT-4, have shown remarkable precision in generating code snippets for sketching out text inputs graphically, e.g., via TikZ. In this work, we introduce Control-GPT to guide the diffusion-based text-to-image pipelines with programmatic sketches generated by GPT-4, enhancing their abilities for instruction following. Control-GPT works by querying GPT-4 to write TikZ code, and the generated sketches are used as references alongside the text instructions for diffusion models (e.g., ControlNet) to generate photo-realistic images. One major challenge to training our pipeline is the lack of a dataset containing aligned text, images, and sketches. We address the issue by converting instance masks in existing datasets into polygons to mimic the sketches used at test time. As a result, Control-GPT greatly boosts the controllability of image generation. It establishes a new state-of-art on the spatial arrangement and object positioning generation and enhances users' control of object positions, sizes, etc., nearly doubling the accuracy of prior models. Our work, as a first attempt, shows the potential for employing LLMs to enhance the performance in computer vision tasks.
Paper Structure (28 sections, 4 equations, 13 figures, 6 tables)

This paper contains 28 sections, 4 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Controllable text-to-image generation with GPT-4 in the loop. Among all the models, Control-GPT is both good at generating multiple objects, and the generated image follows exactly the TikZ sketch. Both DALL-E 2 and Stable Diffusion are not only unable to generate all the objects stated in the text consistently, but it is also hard to control their generated image layout.
  • Figure 2: Control-GPT Architecture. Our model is built on top of ControlNet to take additional grounding text. The model takes in both reference images and grounding object text, fusing them using attention layers before feeding to Stable Diffusion.
  • Figure 3: Training data construction. We convert the instance masks in LVIS data into polygons and use the corresponding images and captions from COCO to construct the training data to fine-tune ControlNet.
  • Figure 4: Visualization of the generated sketches
  • Figure 5: Example prompt for controlling object positions and sizes. Example prompt for benchmarking object position and size for different models. This is directly passed to the GPT-4 to draw the sketch or ControlNet/Stable Diffusion for generating an image.
  • ...and 8 more figures