Learning Human Skill Generators at Key-Step Levels

Yilu Wu; Chenhui Zhu; Shuai Wang; Hanlin Wang; Jing Wang; Zhaoxiang Zhang; Limin Wang

Learning Human Skill Generators at Key-Step Levels

Yilu Wu, Chenhui Zhu, Shuai Wang, Hanlin Wang, Jing Wang, Zhaoxiang Zhang, Limin Wang

TL;DR

The work tackles the challenge of generating long, multi-step human skill videos by introducing KS-Gen, which decomposes skills into key-step clips using an initial state image $I_0$ and a skill goal $G$. It proposes a three-stage pipeline—MLLM-based step planning with a retrieval-augmented planner, a Key-step Image Generator (KIG) to produce consistent first frames for each step, and a video generation stage—to synthesize coherent key-step clips. A well-curated benchmark built from instructional video datasets, together with a comprehensive set of evaluation metrics (action semantic, frame-level semantics, motion dynamics, and perceptual quality) and human studies, supports rigorous assessment. The approach demonstrates substantial improvements over baselines in both automatic metrics and perceived quality, enhancing the realism and pedagogical value of generated skill videos and offering a foundation for embodied intelligence and human skill learning from synthetic data.

Abstract

We are committed to learning human skill generators at key-step levels. The generation of skills is a challenging endeavor, but its successful implementation could greatly facilitate human skill learning and provide more experience for embodied intelligence. Although current video generation models can synthesis simple and atomic human operations, they struggle with human skills due to their complex procedure process. Human skills involve multi-step, long-duration actions and complex scene transitions, so the existing naive auto-regressive methods for synthesizing long videos cannot generate human skills. To address this, we propose a novel task, the Key-step Skill Generation (KS-Gen), aimed at reducing the complexity of generating human skill videos. Given the initial state and a skill description, the task is to generate video clips of key steps to complete the skill, rather than a full-length video. To support this task, we introduce a carefully curated dataset and define multiple evaluation metrics to assess performance. Considering the complexity of KS-Gen, we propose a new framework for this task. First, a multimodal large language model (MLLM) generates descriptions for key steps using retrieval argument. Subsequently, we use a Key-step Image Generator (KIG) to address the discontinuity between key steps in skill videos. Finally, a video generation model uses these descriptions and key-step images to generate video clips of the key steps with high temporal consistency. We offer a detailed analysis of the results, hoping to provide more insights on human skill generation. All models and data are available at https://github.com/MCG-NJU/KS-Gen.

Learning Human Skill Generators at Key-Step Levels

TL;DR

Abstract

Learning Human Skill Generators at Key-Step Levels

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)