Table of Contents
Fetching ...

Data-Efficient Approach to Humanoid Control via Fine-Tuning a Pre-Trained GPT on Action Data

Siddharth Padmanabhan, Kazuki Miyazawa, Takato Horii, Takayuki Nagai

TL;DR

The paper tackles the problem of data-inefficient, multi-task humanoid control by proposing a GPT-based motion foundation model pre-trained on large observational motion data and fine-tuned on smaller observation+action data. The approach uses a minGPT-57M architecture pre-trained on the MoCapAct Large dataset and then fine-tuned on MoCapAct Small with a new action head, optimized via a cross-entropy loss $L_{ce}$ (and an auxiliary MSE component) to generate physically plausible joint trajectories in a MuJoCo humanoid simulator with 56 DoF. Empirical results demonstrate that the proposed Human Motion Generator (HMG) achieves longer, more stable motion trajectories and competitive motion-prediction metrics (FID, ADE, FDE, DIV) compared to models trained from scratch, while requiring significantly less data and training time (about 52 hours for pre-training and 12 hours for fine-tuning). This data-efficient transfer learning workflow offers a practical path toward a humanoid motion foundation model capable of adapting to multiple tasks with reduced computational cost and dataset size. The work highlights the potential of foundation-model-style learning for robotics, suggesting future gains from conditioning, partial observability, and reward-based fine-tuning to further improve realism and task coverage.

Abstract

There are several challenges in developing a model for multi-tasking humanoid control. Reinforcement learning and imitation learning approaches are quite popular in this domain. However, there is a trade-off between the two. Reinforcement learning is not the best option for training a humanoid to perform multiple behaviors due to training time and model size, and imitation learning using kinematics data alone is not appropriate to realize the actual physics of the motion. Training models to perform multiple complex tasks take long training time due to high DoF and complexities of the movements. Although training models offline would be beneficial, another issue is the size of the dataset, usually being quite large to encapsulate multiple movements. There are few implementations of transformer-based models to control humanoid characters and predict their motion based on a large dataset of recorded/reference motion. In this paper, we train a GPT on a large dataset of noisy expert policy rollout observations from a humanoid motion dataset as a pre-trained model and fine tune that model on a smaller dataset of noisy expert policy rollout observations and actions to autoregressively generate physically plausible motion trajectories. We show that it is possible to train a GPT-based foundation model on a smaller dataset in shorter training time to control a humanoid in a realistic physics environment to perform human-like movements.

Data-Efficient Approach to Humanoid Control via Fine-Tuning a Pre-Trained GPT on Action Data

TL;DR

The paper tackles the problem of data-inefficient, multi-task humanoid control by proposing a GPT-based motion foundation model pre-trained on large observational motion data and fine-tuned on smaller observation+action data. The approach uses a minGPT-57M architecture pre-trained on the MoCapAct Large dataset and then fine-tuned on MoCapAct Small with a new action head, optimized via a cross-entropy loss (and an auxiliary MSE component) to generate physically plausible joint trajectories in a MuJoCo humanoid simulator with 56 DoF. Empirical results demonstrate that the proposed Human Motion Generator (HMG) achieves longer, more stable motion trajectories and competitive motion-prediction metrics (FID, ADE, FDE, DIV) compared to models trained from scratch, while requiring significantly less data and training time (about 52 hours for pre-training and 12 hours for fine-tuning). This data-efficient transfer learning workflow offers a practical path toward a humanoid motion foundation model capable of adapting to multiple tasks with reduced computational cost and dataset size. The work highlights the potential of foundation-model-style learning for robotics, suggesting future gains from conditioning, partial observability, and reward-based fine-tuning to further improve realism and task coverage.

Abstract

There are several challenges in developing a model for multi-tasking humanoid control. Reinforcement learning and imitation learning approaches are quite popular in this domain. However, there is a trade-off between the two. Reinforcement learning is not the best option for training a humanoid to perform multiple behaviors due to training time and model size, and imitation learning using kinematics data alone is not appropriate to realize the actual physics of the motion. Training models to perform multiple complex tasks take long training time due to high DoF and complexities of the movements. Although training models offline would be beneficial, another issue is the size of the dataset, usually being quite large to encapsulate multiple movements. There are few implementations of transformer-based models to control humanoid characters and predict their motion based on a large dataset of recorded/reference motion. In this paper, we train a GPT on a large dataset of noisy expert policy rollout observations from a humanoid motion dataset as a pre-trained model and fine tune that model on a smaller dataset of noisy expert policy rollout observations and actions to autoregressively generate physically plausible motion trajectories. We show that it is possible to train a GPT-based foundation model on a smaller dataset in shorter training time to control a humanoid in a realistic physics environment to perform human-like movements.
Paper Structure (16 sections, 5 equations, 6 figures, 4 tables)

This paper contains 16 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Proposed approach to training a GPT policy to autoregressively generate physically plausible motion in simulation by pre-training on a large dataset containing only observations and fine tuning on a small dataset containing observations and actions
  • Figure 2: Detailed Proposed Approach of HMG: On top is the pre-training phase where a GPT is trained on a large observation dataset consisting of only observations; in the middle is the fine tuning phase, where the same GPT weights are used except the observation head which is replaced with an untrained action head to output actions, and the pre-trained model is fine-tuned on a small dataset consisting of both observations and actions; the bottom shows the inference of the resulting HMG. The GPT weights in gray depict that they are frozen, and the GPT weights in other colors denote that they are trainable.
  • Figure 3: MuJoCo humanoid from dm_control package in MuJoCo simulator
  • Figure 4: Performance based on Dataset Sizes
  • Figure 5: Gestures: Top - HMG's prediction does not match the ground truth but survives the entire duration of the episode., Bottom - MoCapAct-Large's motion generation also does not accurately predict the gestures, loses balance and falls.
  • ...and 1 more figures