Daily Assistive View Control Learning of Low-Cost Low-Rigidity Robot via Large-Scale Vision-Language Model
Kento Kawaharazuka, Naoaki Kanazawa, Yoshiki Obinata, Kei Okada, Masayuki Inaba
TL;DR
This work addresses adapting a low-cost, low-rigidity robot arm to open-vocabulary view control driven by linguistic instructions. It proposes SPNPB, a stochastic predictive network that learns the probabilistic mapping between CLIP vision features and the robot’s physical state, with a 2-D parametric bias that encodes time- and environment-dependent changes, updated online. Through basic and advanced experiments, the authors show that combining probabilistic prediction with online PB updates yields accurate, instruction-guided gaze control across varying objects and environments, outperforming models without PB. The results demonstrate practical potential for adaptive, linguistically guided perception in daily-assistive robotics and point to future multi-modal extensions and API integrations for broader task execution.
Abstract
In this study, we develop a simple daily assistive robot that controls its own vision according to linguistic instructions. The robot performs several daily tasks such as recording a user's face, hands, or screen, and remotely capturing images of desired locations. To construct such a robot, we combine a pre-trained large-scale vision-language model with a low-cost low-rigidity robot arm. The correlation between the robot's physical and visual information is learned probabilistically using a neural network, and changes in the probability distribution based on changes in time and environment are considered by parametric bias, which is a learnable network input variable. We demonstrate the effectiveness of this learning method by open-vocabulary view control experiments with an actual robot arm, MyCobot.
