Coherent Zero-Shot Visual Instruction Generation

Quynh Phung; Songwei Ge; Jia-Bin Huang

Coherent Zero-Shot Visual Instruction Generation

Quynh Phung, Songwei Ge, Jia-Bin Huang

TL;DR

This paper addresses the challenge of generating coherent static visual instructions from textual procedures without fine-tuning diffusion models. It introduces a two-stage approach: in-context planning with LLMs to re-caption instructions into state-aware prompts, and an adaptive KV-sharing mechanism during diffusion to enforce cross-step consistency while allowing state changes. The method remains zero-shot, leverages segmentation masks for local consistency and a state similarity matrix for global control, and is evaluated with both traditional metrics and vision-language models showing improved text alignment and continuity. The work offers a practical framework for visual instruction generation and a basis for evaluating such outputs with large-language visual models.

Abstract

Despite the advances in text-to-image synthesis, particularly with diffusion models, generating visual instructions that require consistent representation and smooth state transitions of objects across sequential steps remains a formidable challenge. This paper introduces a simple, training-free framework to tackle the issues, capitalizing on the advancements in diffusion models and large language models (LLMs). Our approach systematically integrates text comprehension and image generation to ensure visual instructions are visually appealing and maintain consistency and accuracy throughout the instruction sequence. We validate the effectiveness by testing multi-step instructions and comparing the text alignment and consistency with several baselines. Our experiments show that our approach can visualize coherent and visually pleasing instructions

Coherent Zero-Shot Visual Instruction Generation

TL;DR

Abstract

Paper Structure (14 sections, 9 equations, 12 figures, 3 tables)

This paper contains 14 sections, 9 equations, 12 figures, 3 tables.

Introduction
Related work
Method
Preliminaries
Re-captioning instructions as descriptive texts
Dynamic consistent image generation
Experiments
Experiment setup
Quantitative results
Qualitative results
Failure cases and discussion
Conclusion
Appendix
Additional discussion and details about large language models

Figures (12)

Figure 1: Visual instruction generation results. Given a sequence of textual instructions for a certain task, our method generates the visual instructions that illustrate the individual steps. Our method is training-free and thus preserves the quality and generalizability of the underlying image generation models. We showcase the generated visual instructions for different tasks from cooking to gardening. The samples possess high visual quality, align with the instructions, and maintain coherent object identity with desired changes at each step.
Figure 2: Limitation of text-to-image generation in visual instructions task . The crucial components of good visual instruction are 1) alignment with the text-based instruction and 2) coherence across different steps demonstrating the state changes. The current text-to-image generation methods focus only on the former. Consequently, the results may confuse the readers. In this paper, we develop a training-free method to enable a more coherent visual instruction generation.
Figure 3: Our framework for zero-shot instruction visualization. Our framework operates in two distinct phases. In the first phase, we use an LLM (e.g., the GPT-4 model) to generate the scene state after each step in the list of instructions. The generated scene state helps guide the image generation in the next stage. We also ask the LLM to generate the similarity between states. This matrix, with each row indicating the visual similarity of a current visual step to others, guides the generation process. For example, to achieve high state similarity, we wish to maintain consistency as much as possible across the two steps. A low state similarity indicates the performed action changes the scene state substantially. In such cases, blindly encouraging consistency across steps may hurt the quality of the visualized instruction image. In the second phase, we utilize a shared attention layer—replacing the standard model—to allow queries from one image to access keys and values from others within the same instruction set. We enhance this sharing mechanism by applying standard attention masking, controlled by the similarity matrix, to finely tune the interaction between visual elements.
Figure 4: Cross-attention map of Stable Cascade. We visualize the corss-attention maps in stage C of Stable Cascade model. It is found that the attention maps are noisy and fail to accurately delineate the specific regions of the main objects: woman and umbrella.
Figure 5: Evaluation of different design choices of text prompts using LLM, including Gemini and GPT-4. Among different evaluation aspects, including text alignment, continuity, consistency, and relevance, our choice of concatenating action and state beats using action only or concatenating with previous actions.
...and 7 more figures

Coherent Zero-Shot Visual Instruction Generation

TL;DR

Abstract

Coherent Zero-Shot Visual Instruction Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)