Data-Efficient Approach to Humanoid Control via Fine-Tuning a Pre-Trained GPT on Action Data
Siddharth Padmanabhan, Kazuki Miyazawa, Takato Horii, Takayuki Nagai
TL;DR
The paper tackles the problem of data-inefficient, multi-task humanoid control by proposing a GPT-based motion foundation model pre-trained on large observational motion data and fine-tuned on smaller observation+action data. The approach uses a minGPT-57M architecture pre-trained on the MoCapAct Large dataset and then fine-tuned on MoCapAct Small with a new action head, optimized via a cross-entropy loss $L_{ce}$ (and an auxiliary MSE component) to generate physically plausible joint trajectories in a MuJoCo humanoid simulator with 56 DoF. Empirical results demonstrate that the proposed Human Motion Generator (HMG) achieves longer, more stable motion trajectories and competitive motion-prediction metrics (FID, ADE, FDE, DIV) compared to models trained from scratch, while requiring significantly less data and training time (about 52 hours for pre-training and 12 hours for fine-tuning). This data-efficient transfer learning workflow offers a practical path toward a humanoid motion foundation model capable of adapting to multiple tasks with reduced computational cost and dataset size. The work highlights the potential of foundation-model-style learning for robotics, suggesting future gains from conditioning, partial observability, and reward-based fine-tuning to further improve realism and task coverage.
Abstract
There are several challenges in developing a model for multi-tasking humanoid control. Reinforcement learning and imitation learning approaches are quite popular in this domain. However, there is a trade-off between the two. Reinforcement learning is not the best option for training a humanoid to perform multiple behaviors due to training time and model size, and imitation learning using kinematics data alone is not appropriate to realize the actual physics of the motion. Training models to perform multiple complex tasks take long training time due to high DoF and complexities of the movements. Although training models offline would be beneficial, another issue is the size of the dataset, usually being quite large to encapsulate multiple movements. There are few implementations of transformer-based models to control humanoid characters and predict their motion based on a large dataset of recorded/reference motion. In this paper, we train a GPT on a large dataset of noisy expert policy rollout observations from a humanoid motion dataset as a pre-trained model and fine tune that model on a smaller dataset of noisy expert policy rollout observations and actions to autoregressively generate physically plausible motion trajectories. We show that it is possible to train a GPT-based foundation model on a smaller dataset in shorter training time to control a humanoid in a realistic physics environment to perform human-like movements.
