Table of Contents
Fetching ...

AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents

Jieming Cui, Tengyu Liu, Nian Liu, Yaodong Yang, Yixin Zhu, Siyuan Huang

TL;DR

AnySkill is demonstrated to have the capability to generate realistic and natural motion sequences in response to unseen instructions of varying lengths, marking it the first method capable of open-vocabulary physical skill learning for interactive humanoid agents.

Abstract

Traditional approaches in physics-based motion generation, centered around imitation learning and reward shaping, often struggle to adapt to new scenarios. To tackle this limitation, we propose AnySkill, a novel hierarchical method that learns physically plausible interactions following open-vocabulary instructions. Our approach begins by developing a set of atomic actions via a low-level controller trained via imitation learning. Upon receiving an open-vocabulary textual instruction, AnySkill employs a high-level policy that selects and integrates these atomic actions to maximize the CLIP similarity between the agent's rendered images and the text. An important feature of our method is the use of image-based rewards for the high-level policy, which allows the agent to learn interactions with objects without manual reward engineering. We demonstrate AnySkill's capability to generate realistic and natural motion sequences in response to unseen instructions of varying lengths, marking it the first method capable of open-vocabulary physical skill learning for interactive humanoid agents.

AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents

TL;DR

AnySkill is demonstrated to have the capability to generate realistic and natural motion sequences in response to unseen instructions of varying lengths, marking it the first method capable of open-vocabulary physical skill learning for interactive humanoid agents.

Abstract

Traditional approaches in physics-based motion generation, centered around imitation learning and reward shaping, often struggle to adapt to new scenarios. To tackle this limitation, we propose AnySkill, a novel hierarchical method that learns physically plausible interactions following open-vocabulary instructions. Our approach begins by developing a set of atomic actions via a low-level controller trained via imitation learning. Upon receiving an open-vocabulary textual instruction, AnySkill employs a high-level policy that selects and integrates these atomic actions to maximize the CLIP similarity between the agent's rendered images and the text. An important feature of our method is the use of image-based rewards for the high-level policy, which allows the agent to learn interactions with objects without manual reward engineering. We demonstrate AnySkill's capability to generate realistic and natural motion sequences in response to unseen instructions of varying lengths, marking it the first method capable of open-vocabulary physical skill learning for interactive humanoid agents.
Paper Structure (32 sections, 3 equations, 18 figures, 5 tables, 1 algorithm)

This paper contains 32 sections, 3 equations, 18 figures, 5 tables, 1 algorithm.

Figures (18)

  • Figure 1: Diverse motions generated by AnySkill conditioned on various instructions. When provided with an open-vocabulary text description of a motion, AnySkill is adept at learning natural and flexible motions that closely align with the description, facilitated by an image-based reward mechanism. Additionally, AnySkill demonstrates proficiency in learning interactions with dynamic objects, showcasing its versatile motion generation capabilities.
  • Figure 2: The hierarchical structure of AnySkill. Initially, the low-level controller (top-left) is trained to encode unlabeled motions into a shared latent space $\mathcal{Z}$. Subsequently, for each open-vocabulary text description, a high-level policy is trained. This policy orchestrates low-level actions to optimize the CLIP similarity between rendered images and the provided text, effectively composing actions that align with the textual instructions.
  • Figure 3: Atomic actions from the trained low-level controller. Each subfigure depicts the green agent demonstrating the reference motion from the dataset, while the white agent illustrates the corresponding learned atomic action.
  • Figure 4: Qualitative comparisons on open-vocabulary motion generation. From top to bottom, the descriptions are "sit down, bent torso, legs folded at knees", "legs off the ground, wave hands", and "coiling the arm, throw a ball". We showcase the most representative frames that best align with the descriptions.
  • Figure 5: Qualitative results of generated motion by AnySkill. Displayed are specific text descriptions and the corresponding motions generated by AnySkill, as evaluated in the user study. Motion sequences progress from left to right.
  • ...and 13 more figures