Table of Contents
Fetching ...

Differentiable Robot Rendering

Ruoshi Liu, Alper Canberk, Shuran Song, Carl Vondrick

TL;DR

Quantitative and qualitative results show that the differentiable rendering model provides effective gradients for robotic control directly from pixels, setting the foundation for the future applications of vision foundation models in robotics.

Abstract

Vision foundation models trained on massive amounts of visual data have shown unprecedented reasoning and planning skills in open-world settings. A key challenge in applying them to robotic tasks is the modality gap between visual data and action data. We introduce differentiable robot rendering, a method allowing the visual appearance of a robot body to be directly differentiable with respect to its control parameters. Our model integrates a kinematics-aware deformable model and Gaussians Splatting and is compatible with any robot form factors and degrees of freedom. We demonstrate its capability and usage in applications including reconstruction of robot poses from images and controlling robots through vision language models. Quantitative and qualitative results show that our differentiable rendering model provides effective gradients for robotic control directly from pixels, setting the foundation for the future applications of vision foundation models in robotics.

Differentiable Robot Rendering

TL;DR

Quantitative and qualitative results show that the differentiable rendering model provides effective gradients for robotic control directly from pixels, setting the foundation for the future applications of vision foundation models in robotics.

Abstract

Vision foundation models trained on massive amounts of visual data have shown unprecedented reasoning and planning skills in open-world settings. A key challenge in applying them to robotic tasks is the modality gap between visual data and action data. We introduce differentiable robot rendering, a method allowing the visual appearance of a robot body to be directly differentiable with respect to its control parameters. Our model integrates a kinematics-aware deformable model and Gaussians Splatting and is compatible with any robot form factors and degrees of freedom. We demonstrate its capability and usage in applications including reconstruction of robot poses from images and controlling robots through vision language models. Quantitative and qualitative results show that our differentiable rendering model provides effective gradients for robotic control directly from pixels, setting the foundation for the future applications of vision foundation models in robotics.

Paper Structure

This paper contains 17 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: We introduce Differentiable Rendering of Robots (Dr. Robot), a robot self-model which is differentiable from its visual appearance to its control parameters. With it, we can perform control and planning of robot actions through image gradients provided by visual foundation models.
  • Figure 2: Rendering Pipeline. Our robot model is composed of 3 differentiable components. Forward kinematics projects a pose vector into a skeleton, Implicit LBS projects 3D Gaussians to the robot surface, and Appearance Deformation adjusts appearance of 3D Gaussians.
  • Figure 3: Visual Quality of Robot Model. Here, we showcase the learned robot model's visual quality by comparing it with results obtained from Deformable Gaussians wu20234d. Due to the complicated kinematic structure of a robot, wu20234d is unable to fit the deformation while ours can.
  • Figure 4: Robot Pose Estimation from a Single Image. From an input image, we perform optimization to reconstruct the joint angles of the robot and overlay the final rendering of the robot on top of the input image. These results show that accurate robot poses can be reconstructed from only a single image through our robot model. This also demonstrated that our robot model provides high-quality gradients for action optimization.
  • Figure 5: Text-to-Robot Hand Gestures We perform optimization of joint angles of a Shadow Hand to maximize the CLIP similarity between the rendered image and text prompt. We show the optimization process (left) as well as final outputs of different prompts (right).
  • ...and 2 more figures