Coherent Zero-Shot Visual Instruction Generation
Quynh Phung, Songwei Ge, Jia-Bin Huang
TL;DR
This paper addresses the challenge of generating coherent static visual instructions from textual procedures without fine-tuning diffusion models. It introduces a two-stage approach: in-context planning with LLMs to re-caption instructions into state-aware prompts, and an adaptive KV-sharing mechanism during diffusion to enforce cross-step consistency while allowing state changes. The method remains zero-shot, leverages segmentation masks for local consistency and a state similarity matrix for global control, and is evaluated with both traditional metrics and vision-language models showing improved text alignment and continuity. The work offers a practical framework for visual instruction generation and a basis for evaluating such outputs with large-language visual models.
Abstract
Despite the advances in text-to-image synthesis, particularly with diffusion models, generating visual instructions that require consistent representation and smooth state transitions of objects across sequential steps remains a formidable challenge. This paper introduces a simple, training-free framework to tackle the issues, capitalizing on the advancements in diffusion models and large language models (LLMs). Our approach systematically integrates text comprehension and image generation to ensure visual instructions are visually appealing and maintain consistency and accuracy throughout the instruction sequence. We validate the effectiveness by testing multi-step instructions and comparing the text alignment and consistency with several baselines. Our experiments show that our approach can visualize coherent and visually pleasing instructions
