Table of Contents
Fetching ...

Learning Human Skill Generators at Key-Step Levels

Yilu Wu, Chenhui Zhu, Shuai Wang, Hanlin Wang, Jing Wang, Zhaoxiang Zhang, Limin Wang

TL;DR

The work tackles the challenge of generating long, multi-step human skill videos by introducing KS-Gen, which decomposes skills into key-step clips using an initial state image $I_0$ and a skill goal $G$. It proposes a three-stage pipeline—MLLM-based step planning with a retrieval-augmented planner, a Key-step Image Generator (KIG) to produce consistent first frames for each step, and a video generation stage—to synthesize coherent key-step clips. A well-curated benchmark built from instructional video datasets, together with a comprehensive set of evaluation metrics (action semantic, frame-level semantics, motion dynamics, and perceptual quality) and human studies, supports rigorous assessment. The approach demonstrates substantial improvements over baselines in both automatic metrics and perceived quality, enhancing the realism and pedagogical value of generated skill videos and offering a foundation for embodied intelligence and human skill learning from synthetic data.

Abstract

We are committed to learning human skill generators at key-step levels. The generation of skills is a challenging endeavor, but its successful implementation could greatly facilitate human skill learning and provide more experience for embodied intelligence. Although current video generation models can synthesis simple and atomic human operations, they struggle with human skills due to their complex procedure process. Human skills involve multi-step, long-duration actions and complex scene transitions, so the existing naive auto-regressive methods for synthesizing long videos cannot generate human skills. To address this, we propose a novel task, the Key-step Skill Generation (KS-Gen), aimed at reducing the complexity of generating human skill videos. Given the initial state and a skill description, the task is to generate video clips of key steps to complete the skill, rather than a full-length video. To support this task, we introduce a carefully curated dataset and define multiple evaluation metrics to assess performance. Considering the complexity of KS-Gen, we propose a new framework for this task. First, a multimodal large language model (MLLM) generates descriptions for key steps using retrieval argument. Subsequently, we use a Key-step Image Generator (KIG) to address the discontinuity between key steps in skill videos. Finally, a video generation model uses these descriptions and key-step images to generate video clips of the key steps with high temporal consistency. We offer a detailed analysis of the results, hoping to provide more insights on human skill generation. All models and data are available at https://github.com/MCG-NJU/KS-Gen.

Learning Human Skill Generators at Key-Step Levels

TL;DR

The work tackles the challenge of generating long, multi-step human skill videos by introducing KS-Gen, which decomposes skills into key-step clips using an initial state image and a skill goal . It proposes a three-stage pipeline—MLLM-based step planning with a retrieval-augmented planner, a Key-step Image Generator (KIG) to produce consistent first frames for each step, and a video generation stage—to synthesize coherent key-step clips. A well-curated benchmark built from instructional video datasets, together with a comprehensive set of evaluation metrics (action semantic, frame-level semantics, motion dynamics, and perceptual quality) and human studies, supports rigorous assessment. The approach demonstrates substantial improvements over baselines in both automatic metrics and perceived quality, enhancing the realism and pedagogical value of generated skill videos and offering a foundation for embodied intelligence and human skill learning from synthetic data.

Abstract

We are committed to learning human skill generators at key-step levels. The generation of skills is a challenging endeavor, but its successful implementation could greatly facilitate human skill learning and provide more experience for embodied intelligence. Although current video generation models can synthesis simple and atomic human operations, they struggle with human skills due to their complex procedure process. Human skills involve multi-step, long-duration actions and complex scene transitions, so the existing naive auto-regressive methods for synthesizing long videos cannot generate human skills. To address this, we propose a novel task, the Key-step Skill Generation (KS-Gen), aimed at reducing the complexity of generating human skill videos. Given the initial state and a skill description, the task is to generate video clips of key steps to complete the skill, rather than a full-length video. To support this task, we introduce a carefully curated dataset and define multiple evaluation metrics to assess performance. Considering the complexity of KS-Gen, we propose a new framework for this task. First, a multimodal large language model (MLLM) generates descriptions for key steps using retrieval argument. Subsequently, we use a Key-step Image Generator (KIG) to address the discontinuity between key steps in skill videos. Finally, a video generation model uses these descriptions and key-step images to generate video clips of the key steps with high temporal consistency. We offer a detailed analysis of the results, hoping to provide more insights on human skill generation. All models and data are available at https://github.com/MCG-NJU/KS-Gen.

Paper Structure

This paper contains 23 sections, 1 equation, 11 figures, 11 tables.

Figures (11)

  • Figure 1: This figure presents three different tasks related to human skill learning. Given an initial image and a text prompt, procedure planning generates a series of steps in textual form. Video generation models can produce a single action video based on detailed text prompts. In contrast, KS-Gen generates multiple key-step videos that complete the skill, using only a simple skill description and image as input.
  • Figure 2: Overview of key-step skill generator. Taking the skill "make matcha" as an example, this skill includes three key steps. First, based on the given initial image and skill description, we generate detailed descriptions of the three steps through a MLLM using retrieval argument (RAG). (The figure shows the simplified step descriptions.) Then, We input the initial image and step description into the Key-step Image Generation model to generate the first frame of each step. Finally, we use the generated step descriptions as prompts for the video generation model and create video clips corresponding to each of the three key steps, based on the the corresponding key-step images.
  • Figure 3: Key-step Image Generation. The input consists of an initial image and step descriptions, from which features are extracted using the IP-Adapter image encoder and CLIP text encoder, respectively. These image and text features are fed into a multi-layer Transformer decoder to autoregressively generate the image features for subsequent clips. The predicted features are then injected into Stable Diffusion XL with IP-Adapter to produce the images.
  • Figure 4: The visualization of the skill generator.
  • Figure 5: The visulization with different image generation models.
  • ...and 6 more figures