Table of Contents
Fetching ...

Learning High-Fidelity Robot Self-Model with Articulated 3D Gaussian Splatting

Kejun Hu, Peng Yu, Ning Tan

TL;DR

This work presents a texture-aware robot self-modeling framework based on 3D Gaussian Splatting (3DGS) that jointly models morphology, texture, and kinematics from multi-view RGB images. A static self-model is first constructed from rest-pose data, then neural ellipsoid bones and a kinematic network are learned to deform the Gaussian representation according to joint angles, enabling real-time rendering and control via Gaussian splatting. The approach achieves high geometric fidelity and texture visualization at the link level, outperforming depth-dependent methods while enabling downstream tasks such as motion planning and inverse kinematics. The method is validated in simulation and on a physical robot, demonstrating accurate rendering, mesh extraction, and practical applications like reaching, surface touching, obstacle avoidance, and IK estimation, with potential for broader robotic platforms and extensions to soft robots.

Abstract

Self-modeling enables robots to build task-agnostic models of their morphology and kinematics based on data that can be automatically collected, with minimal human intervention and prior information, thereby enhancing machine intelligence. Recent research has highlighted the potential of data-driven technology in modeling the morphology and kinematics of robots. However, existing self-modeling methods suffer from either low modeling quality or excessive data acquisition costs. Beyond morphology and kinematics, texture is also a crucial component of robots, which is challenging to model and remains unexplored. In this work, a high-quality, texture-aware, and link-level method is proposed for robot self-modeling. We utilize three-dimensional (3D) Gaussians to represent the static morphology and texture of robots, and cluster the 3D Gaussians to construct neural ellipsoid bones, whose deformations are controlled by the transformation matrices generated by a kinematic neural network. The 3D Gaussians and kinematic neural network are trained using data pairs composed of joint angles, camera parameters and multi-view images without depth information. By feeding the kinematic neural network with joint angles, we can utilize the well-trained model to describe the corresponding morphology, kinematics and texture of robots at the link level, and render robot images from different perspectives with the aid of 3D Gaussian splatting. Furthermore, we demonstrate that the established model can be exploited to perform downstream tasks such as motion planning and inverse kinematics.

Learning High-Fidelity Robot Self-Model with Articulated 3D Gaussian Splatting

TL;DR

This work presents a texture-aware robot self-modeling framework based on 3D Gaussian Splatting (3DGS) that jointly models morphology, texture, and kinematics from multi-view RGB images. A static self-model is first constructed from rest-pose data, then neural ellipsoid bones and a kinematic network are learned to deform the Gaussian representation according to joint angles, enabling real-time rendering and control via Gaussian splatting. The approach achieves high geometric fidelity and texture visualization at the link level, outperforming depth-dependent methods while enabling downstream tasks such as motion planning and inverse kinematics. The method is validated in simulation and on a physical robot, demonstrating accurate rendering, mesh extraction, and practical applications like reaching, surface touching, obstacle avoidance, and IK estimation, with potential for broader robotic platforms and extensions to soft robots.

Abstract

Self-modeling enables robots to build task-agnostic models of their morphology and kinematics based on data that can be automatically collected, with minimal human intervention and prior information, thereby enhancing machine intelligence. Recent research has highlighted the potential of data-driven technology in modeling the morphology and kinematics of robots. However, existing self-modeling methods suffer from either low modeling quality or excessive data acquisition costs. Beyond morphology and kinematics, texture is also a crucial component of robots, which is challenging to model and remains unexplored. In this work, a high-quality, texture-aware, and link-level method is proposed for robot self-modeling. We utilize three-dimensional (3D) Gaussians to represent the static morphology and texture of robots, and cluster the 3D Gaussians to construct neural ellipsoid bones, whose deformations are controlled by the transformation matrices generated by a kinematic neural network. The 3D Gaussians and kinematic neural network are trained using data pairs composed of joint angles, camera parameters and multi-view images without depth information. By feeding the kinematic neural network with joint angles, we can utilize the well-trained model to describe the corresponding morphology, kinematics and texture of robots at the link level, and render robot images from different perspectives with the aid of 3D Gaussian splatting. Furthermore, we demonstrate that the established model can be exploited to perform downstream tasks such as motion planning and inverse kinematics.

Paper Structure

This paper contains 23 sections, 29 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Robot self-modeling from images. a Given a robot whose model is unknown, our method can build a self-model of the robot from multiple RGB images. We capture multi-view photos of the robot at different joint configurations, and train the 3D Gaussians and the kinematic network, which are responsible for representing the static morphology and surface color of robots and neural bone deformation, respectively. The well-trained model can be utilized to render robot images with the aid of 3D Gaussian splatting and perform downstream tasks. b Compared to the methods proposed by chen2022fully and schulze2024high, our method can learn a high-quality, surface color-aware, and link-level self-model.
  • Figure 2: Overview of our method. We first train a static 3DGS self-model with images of the robot at zero pose from various views. Then, we initialize neural ellipsoid bones to model the robot kinematic chain. Using images of the robot at different joint configurations, we train a dynamic self-model. The neural bones are deformed, and the relative 3D Gaussian is controlled by LBS. Finally, the model is rendered into images and compared with the ground truth images.
  • Figure 3: Illustration of 3D Gaussian deformation. We use one 3D Gaussian and two neural bones as an example. The deformation of the 3D Gaussian is a weighted sum of the transformation of the two bones. The weight is divided into two parts. One is based on the distance between 3D Gaussian and neural bones, and the other is predicted by a weight MLP.
  • Figure 4: Self-modeling result and ground truth in the simulation environment. Given a random joint configuration, we rendered the self-model to images from 3 different view angles. The joint configurations are shown in radians.
  • Figure 5: Mesh extracted from the point could formed by 3D Gaussians. We illustrate five views of the extracted meshes. They are one top-down view along with four orthogonal views: front, back, left, and right.
  • ...and 6 more figures