Table of Contents
Fetching ...

MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation

Yiren Song, Cheng Liu, Mike Zheng Shou

TL;DR

This work tackles cross-domain procedural sequence generation by introducing MakeAnything, a diffusion-transformer framework that leverages asymmetric LoRA fine-tuning and a ReCraft image-conditioned module. It curates a large 21-task, 24k-sequence dataset to enable multi-task learning and cross-domain generalization. The approach demonstrates superior text-to-process and image-to-process capabilities, including unseen domains, through extensive evaluations and user studies. Overall, it establishes a unified, controllable paradigm for generating step-by-step procedures across diverse domains with strong coherence and visual consistency.

Abstract

A hallmark of human intelligence is the ability to create complex artifacts through structured multi-step processes. Generating procedural tutorials with AI is a longstanding but challenging goal, facing three key obstacles: (1) scarcity of multi-task procedural datasets, (2) maintaining logical continuity and visual consistency between steps, and (3) generalizing across multiple domains. To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. We introduce asymmetric low-rank adaptation (LoRA) for image generation, which balances generalization capabilities and task-specific performance by freezing encoder parameters while adaptively tuning decoder layers. Additionally, our ReCraft model enables image-to-process generation through spatiotemporal consistency constraints, allowing static images to be decomposed into plausible creation sequences. Extensive experiments demonstrate that MakeAnything surpasses existing methods, setting new performance benchmarks for procedural generation tasks.

MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation

TL;DR

This work tackles cross-domain procedural sequence generation by introducing MakeAnything, a diffusion-transformer framework that leverages asymmetric LoRA fine-tuning and a ReCraft image-conditioned module. It curates a large 21-task, 24k-sequence dataset to enable multi-task learning and cross-domain generalization. The approach demonstrates superior text-to-process and image-to-process capabilities, including unseen domains, through extensive evaluations and user studies. Overall, it establishes a unified, controllable paradigm for generating step-by-step procedures across diverse domains with strong coherence and visual consistency.

Abstract

A hallmark of human intelligence is the ability to create complex artifacts through structured multi-step processes. Generating procedural tutorials with AI is a longstanding but challenging goal, facing three key obstacles: (1) scarcity of multi-task procedural datasets, (2) maintaining logical continuity and visual consistency between steps, and (3) generalizing across multiple domains. To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. We introduce asymmetric low-rank adaptation (LoRA) for image generation, which balances generalization capabilities and task-specific performance by freezing encoder parameters while adaptively tuning decoder layers. Additionally, our ReCraft model enables image-to-process generation through spatiotemporal consistency constraints, allowing static images to be decomposed into plausible creation sequences. Extensive experiments demonstrate that MakeAnything surpasses existing methods, setting new performance benchmarks for procedural generation tasks.

Paper Structure

This paper contains 23 sections, 6 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: We introduce MakeAnything, a tool that realistically and logically generates step-by-step procedural tutorial for activities such as painting, crafting, and cooking, based on text descriptions or conditioned images.
  • Figure 2: The MakeAnything framework comprises two core components: (1) an Asymmetric LoRA module that generates diverse creation processes from text prompts through asymmetric LoRA, and (2) the ReCraft Model, which constructs an image-conditioned base model by merging pretrained LoRA weights with the Flux foundation model, enabling process prediction via injected visual tokens.
  • Figure 3: Examples from the MakeAnything Dataset, which consists of 21 tasks with over 24,000 procedural sequences.
  • Figure 4: Generation results of MakeAnything. From top: Text-to-Sequence outputs conditioned on textual prompts; Image-to-Sequence reconstructions via ReCraft Model; Unseen Domain generalization combining procedural LoRA (blue) with stylistic LoRA (red).
  • Figure 5: Compare with baselines on different tasks.
  • ...and 6 more figures