Table of Contents
Fetching ...

Body Transformer: Leveraging Robot Embodiment for Policy Learning

Carmelo Sferrazza, Dun-Ming Huang, Fangchen Liu, Jongmin Lee, Pieter Abbeel

TL;DR

BoT addresses the mismatch between standard transformers and robot morphologies by encoding the robot as a graph of sensors and actuators and applying masked attention that respects the body structure. The approach yields an embodiment-induced inductive bias that improves learning performance, generalization, and scaling in both imitation and reinforcement learning tasks, including a real-world Unitree A1 deployment. It also reports substantial computational benefits from sparsity in masked attention, enabling faster training. The work points to future extensions to temporal processing to enhance real-world applicability.

Abstract

In recent years, the transformer architecture has become the de facto standard for machine learning algorithms applied to natural language processing and computer vision. Despite notable evidence of successful deployment of this architecture in the context of robot learning, we claim that vanilla transformers do not fully exploit the structure of the robot learning problem. Therefore, we propose Body Transformer (BoT), an architecture that leverages the robot embodiment by providing an inductive bias that guides the learning process. We represent the robot body as a graph of sensors and actuators, and rely on masked attention to pool information throughout the architecture. The resulting architecture outperforms the vanilla transformer, as well as the classical multilayer perceptron, in terms of task completion, scaling properties, and computational efficiency when representing either imitation or reinforcement learning policies. Additional material including the open-source code is available at https://sferrazza.cc/bot_site.

Body Transformer: Leveraging Robot Embodiment for Policy Learning

TL;DR

BoT addresses the mismatch between standard transformers and robot morphologies by encoding the robot as a graph of sensors and actuators and applying masked attention that respects the body structure. The approach yields an embodiment-induced inductive bias that improves learning performance, generalization, and scaling in both imitation and reinforcement learning tasks, including a real-world Unitree A1 deployment. It also reports substantial computational benefits from sparsity in masked attention, enabling faster training. The work points to future extensions to temporal processing to enhance real-world applicability.

Abstract

In recent years, the transformer architecture has become the de facto standard for machine learning algorithms applied to natural language processing and computer vision. Despite notable evidence of successful deployment of this architecture in the context of robot learning, we claim that vanilla transformers do not fully exploit the structure of the robot learning problem. Therefore, we propose Body Transformer (BoT), an architecture that leverages the robot embodiment by providing an inductive bias that guides the learning process. We represent the robot body as a graph of sensors and actuators, and rely on masked attention to pool information throughout the architecture. The resulting architecture outperforms the vanilla transformer, as well as the classical multilayer perceptron, in terms of task completion, scaling properties, and computational efficiency when representing either imitation or reinforcement learning policies. Additional material including the open-source code is available at https://sferrazza.cc/bot_site.
Paper Structure (26 sections, 4 equations, 13 figures)

This paper contains 26 sections, 4 equations, 13 figures.

Figures (13)

  • Figure 1: Body Transformer (BoT) is an architecture that considers physical agents as graphs of sensors and actuators as nodes, and edges reflecting the structure of the robot body. BoT leverages masked attention as a simple but flexible mechanism to provide a body-induced bias to the policy. The figure presents the overall schematic of our architecture, exemplified on a Unitree A1 robot.
  • Figure 2: Formulation of Embodiment Mask. The mask $M$ is constructed by adding a diagonal of $1$s to the embodiment graph's adjacency matrices. Here, we visualize a simple example of a mask $M$ for an arbitrary agent's embodiment where $n=10$.
  • Figure 3: BoT Performance on Imitation Learning.
  • Figure 4: Adroit Hand Door, Hammer, and Relocate Tasks (See Results in the Appendix).
  • Figure 5: Reinforcement Learning Performance on Robotic Control Tasks.
  • ...and 8 more figures