Table of Contents
Fetching ...

Generating Illustrated Instructions

Sachit Menon, Ishan Misra, Rohit Girdhar

TL;DR

This work introduces Illustrated Instructions, a task to generate text and visuals tailored to a user’s goal, and formalizes three fidelity desiderata: goal faithfulness, step faithfulness, and cross-image consistency. It presents StackedDiffusion, a diffusion-based approach that stacks latents and uses separate goal/step encodings with step-positional cues and spatial tiling to produce coherent, multi-image instructional articles without additional learnable parameters. Across a WikiHow-derived dataset, the method outperforms baselines including frozen/finetuned text-to-image models and multimodal LLMs, with strong automated metrics and human preferences, and in some cases surpassing ground-truth illustrations. The approach enables practical, personalized instruction with goal suggestion and error correction, paving the way for adaptive, visually guided learning beyond static web articles.

Abstract

We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles. Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation.

Generating Illustrated Instructions

TL;DR

This work introduces Illustrated Instructions, a task to generate text and visuals tailored to a user’s goal, and formalizes three fidelity desiderata: goal faithfulness, step faithfulness, and cross-image consistency. It presents StackedDiffusion, a diffusion-based approach that stacks latents and uses separate goal/step encodings with step-positional cues and spatial tiling to produce coherent, multi-image instructional articles without additional learnable parameters. Across a WikiHow-derived dataset, the method outperforms baselines including frozen/finetuned text-to-image models and multimodal LLMs, with strong automated metrics and human preferences, and in some cases surpassing ground-truth illustrations. The approach enables practical, personalized instruction with goal suggestion and error correction, paving the way for adaptive, visually guided learning beyond static web articles.

Abstract

We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles. Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation.
Paper Structure (27 sections, 2 equations, 22 figures, 2 tables)

This paper contains 27 sections, 2 equations, 22 figures, 2 tables.

Figures (22)

  • Figure 1: StackedDiffusion generating Illustrated Instructions. Given a goal (or any textual user input), StackedDiffusion produces a customized instructional article complete with illustrations that not only tells the user how to achieve the goal in words, but also shows the user by providing illustrations.
  • Figure 2: Failure modes of a naive approach. A frozen T2I model is not able to capture both the goal and the step, showing only one or the other depending on how it is prompted. Further, it can not produce consistent images, leading to odd changes such as the color of the ice varying between images.
  • Figure 3: Overview of StackedDiffusion. At training time, we use the given goal and step text, and stack the encoded ground truth step-images. At inference time, we obtain the goal and step text from an LLM, and unstack denoised latents to produce the output images. See \ref{['sec:method']} for details and notation.
  • Figure 4: Illustrated Instructions data. The histogram shows the distribution of step counts in the data. We find that more than 80% of articles consist of $6$ or fewer steps.
  • Figure 5: Metrics. We introduce three metrics, one for each desideratum presented in \ref{['sec:problem']}. Goal faithfulness: the second image does not show muffins. Step faithfulness: the second image does not show the step. Cross-image consistency: the second image shows a different number of meatballs with different visuals.
  • ...and 17 more figures