Table of Contents
Fetching ...

Latent Representations for Visual Proprioception in Inexpensive Robots

Sahara Sheikholeslami, Ladislau Bölöni

TL;DR

This work addresses visual proprioception for inexpensive robots by asking whether a fast, single-pass regression can infer the robot's 6-DOF configuration from a single RGB image using a compact latent representation ${\mathbf{z}}_{prop}$ of size ${128}$ or ${256}$. It compares four latent-encoder families—Conv-VAE, proprioception-tuned CNNs, Vision Transformers, and bags of uncalibrated ArUco markers—and deploys a uniform MLP regressor to map ${\mathbf{z}}_{prop}$ to ${\mathbf{a}} \in [0,1]^6$. Experimental results with a 6-DOF Lynxmotion arm show that accuracy depends on the component, with heading easiest and wrist rotation/gripper state hardest; 128-dimensional latents often perform on par with or better than 256-dimensional ones, though performance varies by representation. The findings demonstrate feasible, low-computation visual proprioception for inexpensive robots and provide guidance on representation choice and practical deployment, with future work including temporal filtering and additional sensing.

Abstract

Robotic manipulation requires explicit or implicit knowledge of the robot's joint positions. Precise proprioception is standard in high-quality industrial robots but is often unavailable in inexpensive robots operating in unstructured environments. In this paper, we ask: to what extent can a fast, single-pass regression architecture perform visual proprioception from a single external camera image, available even in the simplest manipulation settings? We explore several latent representations, including CNNs, VAEs, ViTs, and bags of uncalibrated fiducial markers, using fine-tuning techniques adapted to the limited data available. We evaluate the achievable accuracy through experiments on an inexpensive 6-DoF robot.

Latent Representations for Visual Proprioception in Inexpensive Robots

TL;DR

This work addresses visual proprioception for inexpensive robots by asking whether a fast, single-pass regression can infer the robot's 6-DOF configuration from a single RGB image using a compact latent representation of size or . It compares four latent-encoder families—Conv-VAE, proprioception-tuned CNNs, Vision Transformers, and bags of uncalibrated ArUco markers—and deploys a uniform MLP regressor to map to . Experimental results with a 6-DOF Lynxmotion arm show that accuracy depends on the component, with heading easiest and wrist rotation/gripper state hardest; 128-dimensional latents often perform on par with or better than 256-dimensional ones, though performance varies by representation. The findings demonstrate feasible, low-computation visual proprioception for inexpensive robots and provide guidance on representation choice and practical deployment, with future work including temporal filtering and additional sensing.

Abstract

Robotic manipulation requires explicit or implicit knowledge of the robot's joint positions. Precise proprioception is standard in high-quality industrial robots but is often unavailable in inexpensive robots operating in unstructured environments. In this paper, we ask: to what extent can a fast, single-pass regression architecture perform visual proprioception from a single external camera image, available even in the simplest manipulation settings? We explore several latent representations, including CNNs, VAEs, ViTs, and bags of uncalibrated fiducial markers, using fine-tuning techniques adapted to the limited data available. We evaluate the achievable accuracy through experiments on an inexpensive 6-DoF robot.

Paper Structure

This paper contains 10 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: (a) The experimental setup for visual proprioception. An inexpensive robot with six degrees of freedom (Lynxmotion AL5D) is being observed by a low-resolution camera. Neither the robot not the camera is calibrated. The objective is to recover the configuration of the robot from a single captured RGB image. (b) Four examples of robot observations illustrating the challenges of proprioception.
  • Figure 2: (a) Proprioception regression. From the observation $o$ the latent encoder creates the latent representation $\mathbf{z}$. The proprioception regressor creates an approximation $\hat{\mathbf{a}}$ of the robot configuration. (b-e) Four variations of encoders to obtain the proprioception-dedicated latent representation $\mathbf{z}_\textit{prop}$ (in green). For all encoders, the components outside the encoder block are supporting the surrogate losses and are discarded after training.
  • Figure 3: Accuracy (a) and tracking (b-f) results based on observations from the side camera.
  • Figure 4: Accuracy (a) and tracking (b-f) results based on observations from the front camera.