Table of Contents
Fetching ...

Meta-Controller: Few-Shot Imitation of Unseen Embodiments and Tasks in Continuous Control

Seongwoong Cho, Donggyun Kim, Jinwoo Lee, Seunghoon Hong

TL;DR

This work tackles simultaneous few-shot generalization to unseen robot embodiments and tasks in continuous control. It introduces Meta-Controller, which unifies heterogeneous embodiments via a joint-level I/O representation and a structure-motion state encoder, paired with a matching-based policy that adapts from a handful of reward-free demonstrations. The approach is trained with episodic meta-learning and then fine-tuned using few-shot data, and it demonstrates superior generalization on the DeepMind Control suite compared to both modular policy learning and few-shot imitation baselines. Key innovations include a two-part state encoder that disentangles morphology and dynamics, and a non-parametric matching mechanism that recombines local motor skills to form robust policies. The results highlight improved cross-embodiment and cross-task adaptation, with practical implications for versatile and data-efficient robotic learning, albeit with considerations for real-world transfer and computational demands.

Abstract

Generalizing across robot embodiments and tasks is crucial for adaptive robotic systems. Modular policy learning approaches adapt to new embodiments but are limited to specific tasks, while few-shot imitation learning (IL) approaches often focus on a single embodiment. In this paper, we introduce a few-shot behavior cloning framework to simultaneously generalize to unseen embodiments and tasks using a few (\emph{e.g.,} five) reward-free demonstrations. Our framework leverages a joint-level input-output representation to unify the state and action spaces of heterogeneous embodiments and employs a novel structure-motion state encoder that is parameterized to capture both shared knowledge across all embodiments and embodiment-specific knowledge. A matching-based policy network then predicts actions from a few demonstrations, producing an adaptive policy that is robust to over-fitting. Evaluated in the DeepMind Control suite, our framework termed \modelname{} demonstrates superior few-shot generalization to unseen embodiments and tasks over modular policy learning and few-shot IL approaches. Codes are available at \href{https://github.com/SeongwoongCho/meta-controller}{https://github.com/SeongwoongCho/meta-controller}.

Meta-Controller: Few-Shot Imitation of Unseen Embodiments and Tasks in Continuous Control

TL;DR

This work tackles simultaneous few-shot generalization to unseen robot embodiments and tasks in continuous control. It introduces Meta-Controller, which unifies heterogeneous embodiments via a joint-level I/O representation and a structure-motion state encoder, paired with a matching-based policy that adapts from a handful of reward-free demonstrations. The approach is trained with episodic meta-learning and then fine-tuned using few-shot data, and it demonstrates superior generalization on the DeepMind Control suite compared to both modular policy learning and few-shot imitation baselines. Key innovations include a two-part state encoder that disentangles morphology and dynamics, and a non-parametric matching mechanism that recombines local motor skills to form robust policies. The results highlight improved cross-embodiment and cross-task adaptation, with practical implications for versatile and data-efficient robotic learning, albeit with considerations for real-world transfer and computational demands.

Abstract

Generalizing across robot embodiments and tasks is crucial for adaptive robotic systems. Modular policy learning approaches adapt to new embodiments but are limited to specific tasks, while few-shot imitation learning (IL) approaches often focus on a single embodiment. In this paper, we introduce a few-shot behavior cloning framework to simultaneously generalize to unseen embodiments and tasks using a few (\emph{e.g.,} five) reward-free demonstrations. Our framework leverages a joint-level input-output representation to unify the state and action spaces of heterogeneous embodiments and employs a novel structure-motion state encoder that is parameterized to capture both shared knowledge across all embodiments and embodiment-specific knowledge. A matching-based policy network then predicts actions from a few demonstrations, producing an adaptive policy that is robust to over-fitting. Evaluated in the DeepMind Control suite, our framework termed \modelname{} demonstrates superior few-shot generalization to unseen embodiments and tasks over modular policy learning and few-shot IL approaches. Codes are available at \href{https://github.com/SeongwoongCho/meta-controller}{https://github.com/SeongwoongCho/meta-controller}.

Paper Structure

This paper contains 58 sections, 9 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: The overall framework of Meta-Controller. First, the states and actions of various robot embodiments are tokenized into joint-level representations. The state tokens are then encoded by the state encoder to capture knowledge about the embodiments. Finally, a matching-based policy network uses few-shot demonstrations with the encoded state features to predict per-joint actions.
  • Figure 2: The state encoder $f$ consists of two component transformers. Joint-level state tokens are first encoded by the structure encoder $f_s$ along the joint axis, where the positional embedding and a part of backbone parameters adapt the model to each embodiment. The features are then passed to the motion encoder $f_m$, which computes causal attentions of per-joint features along the temporal axis, where a part of backbone parameters adapt the model both to the embodiment and task.
  • Figure 3: An illustration of the matching-based policy network $\pi$. (a) Each state and action token in few-shot demonstrations is encoded by the corresponding encoders $f$ and $g$, where we use the same encoder $f$ used for the current state. A matching module $\sigma$ then computes the weighted sum of action features based on the joint-wise similarity between state features. Finally, an action decoder $h$ decodes the joint-wise matching output to predict the current action. (b) Both the action encoder $g$ and decoder $h$ are causal transformers operating along the temporal axis of action tokens and features.
  • Figure 4: Qualitative comparison on the hard task of the reacher-four embodiment, visualizing the final states of the demonstrations and the rollout trajectories of each model. In this task, the robot must move its limb tip to the goal position (visualized as a red ball). While most of the baselines converge to one of the poses in the demonstrations and ignore the goal position, our model accurately solves the task with a distinct pose from the demonstrations.
  • Figure 5: Ablation study on the number of demonstrations. We plot the normalized scores for each pair of embodiment and task $(\mathcal{E}, \mathcal{T})$ and their average, varying the number of shots as 5, 10, and 20.
  • ...and 12 more figures