Latent Representations for Visual Proprioception in Inexpensive Robots
Sahara Sheikholeslami, Ladislau Bölöni
TL;DR
This work addresses visual proprioception for inexpensive robots by asking whether a fast, single-pass regression can infer the robot's 6-DOF configuration from a single RGB image using a compact latent representation ${\mathbf{z}}_{prop}$ of size ${128}$ or ${256}$. It compares four latent-encoder families—Conv-VAE, proprioception-tuned CNNs, Vision Transformers, and bags of uncalibrated ArUco markers—and deploys a uniform MLP regressor to map ${\mathbf{z}}_{prop}$ to ${\mathbf{a}} \in [0,1]^6$. Experimental results with a 6-DOF Lynxmotion arm show that accuracy depends on the component, with heading easiest and wrist rotation/gripper state hardest; 128-dimensional latents often perform on par with or better than 256-dimensional ones, though performance varies by representation. The findings demonstrate feasible, low-computation visual proprioception for inexpensive robots and provide guidance on representation choice and practical deployment, with future work including temporal filtering and additional sensing.
Abstract
Robotic manipulation requires explicit or implicit knowledge of the robot's joint positions. Precise proprioception is standard in high-quality industrial robots but is often unavailable in inexpensive robots operating in unstructured environments. In this paper, we ask: to what extent can a fast, single-pass regression architecture perform visual proprioception from a single external camera image, available even in the simplest manipulation settings? We explore several latent representations, including CNNs, VAEs, ViTs, and bags of uncalibrated fiducial markers, using fine-tuning techniques adapted to the limited data available. We evaluate the achievable accuracy through experiments on an inexpensive 6-DoF robot.
