Table of Contents
Fetching ...

Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection

Yucheng Suo, Fan Ma, Kaixin Shen, Linchao Zhu, Yi Yang

TL;DR

This work tackles the problem of producing coherent, illustrative visual instructions for long-horizon tasks. It introduces LIGER, a training-free pipeline that combines history-aware draft image generation, tool-based self-reflection, and inversion-guided memory calibration to maintain logical consistency and accurate object attributes across many steps. A new dataset of 569 long-horizon tasks with human-annotated ground-truth expressions and logic relations is curated to evaluate visual illustration quality and semantic alignment. Experiments show LIGER outperforms baselines on automatic metrics and human judgments, demonstrating improved illustrativeness, logical coherence, and attribute accuracy with a memory-informed, self-reflective approach.

Abstract

Visual instructions for long-horizon tasks are crucial as they intuitively clarify complex concepts and enhance retention across extended steps. Directly generating a series of images using text-to-image models without considering the context of previous steps results in inconsistent images, increasing cognitive load. Additionally, the generated images often miss objects or the attributes such as color, shape, and state of the objects are inaccurate. To address these challenges, we propose LIGER, the first training-free framework for Long-horizon Instruction GEneration with logic and attribute self-Reflection. LIGER first generates a draft image for each step with the historical prompt and visual memory of previous steps. This step-by-step generation approach maintains consistency between images in long-horizon tasks. Moreover, LIGER utilizes various image editing tools to rectify errors including wrong attributes, logic errors, object redundancy, and identity inconsistency in the draft images. Through this self-reflection mechanism, LIGER improves the logic and object attribute correctness of the images. To verify whether the generated images assist human understanding, we manually curated a new benchmark consisting of various long-horizon tasks. Human-annotated ground truth expressions reflect the human-defined criteria for how an image should appear to be illustrative. Experiments demonstrate the visual instructions generated by LIGER are more comprehensive compared with baseline methods.

Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection

TL;DR

This work tackles the problem of producing coherent, illustrative visual instructions for long-horizon tasks. It introduces LIGER, a training-free pipeline that combines history-aware draft image generation, tool-based self-reflection, and inversion-guided memory calibration to maintain logical consistency and accurate object attributes across many steps. A new dataset of 569 long-horizon tasks with human-annotated ground-truth expressions and logic relations is curated to evaluate visual illustration quality and semantic alignment. Experiments show LIGER outperforms baselines on automatic metrics and human judgments, demonstrating improved illustrativeness, logical coherence, and attribute accuracy with a memory-informed, self-reflective approach.

Abstract

Visual instructions for long-horizon tasks are crucial as they intuitively clarify complex concepts and enhance retention across extended steps. Directly generating a series of images using text-to-image models without considering the context of previous steps results in inconsistent images, increasing cognitive load. Additionally, the generated images often miss objects or the attributes such as color, shape, and state of the objects are inaccurate. To address these challenges, we propose LIGER, the first training-free framework for Long-horizon Instruction GEneration with logic and attribute self-Reflection. LIGER first generates a draft image for each step with the historical prompt and visual memory of previous steps. This step-by-step generation approach maintains consistency between images in long-horizon tasks. Moreover, LIGER utilizes various image editing tools to rectify errors including wrong attributes, logic errors, object redundancy, and identity inconsistency in the draft images. Through this self-reflection mechanism, LIGER improves the logic and object attribute correctness of the images. To verify whether the generated images assist human understanding, we manually curated a new benchmark consisting of various long-horizon tasks. Human-annotated ground truth expressions reflect the human-defined criteria for how an image should appear to be illustrative. Experiments demonstrate the visual instructions generated by LIGER are more comprehensive compared with baseline methods.

Paper Structure

This paper contains 20 sections, 4 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Visual instruction generated by LIGER, key merits are highlighted in the figure.
  • Figure 2: Pipeline overview. LIGER generates visual instructions step-by-step, starting with (1) generating a draft image taking the visual memory, step description and historical prompt as input. (2) The error detector identifies the error and the corresponding tool fixes it, generating a revised image. (3) The referee tool compares the two images and selects one as the final output. LIGER further uses inversion-guided visual memory calibration for future step generation.
  • Figure 3: Visualization of different error types and the effect of self-reflection. The motivation of self-reflection is to rectify errors including (a) over-consistent, (b) object redundant, (c) inconsistent identity, and (d) wrong attributes.
  • Figure 4: Dataset statistics and the influence of the step length of tasks.
  • Figure 5: Detailed qualitative comparisons on different long-horizon tasks. Zoom in to see details.
  • ...and 5 more figures