Table of Contents
Fetching ...

Daily Assistive View Control Learning of Low-Cost Low-Rigidity Robot via Large-Scale Vision-Language Model

Kento Kawaharazuka, Naoaki Kanazawa, Yoshiki Obinata, Kei Okada, Masayuki Inaba

TL;DR

This work addresses adapting a low-cost, low-rigidity robot arm to open-vocabulary view control driven by linguistic instructions. It proposes SPNPB, a stochastic predictive network that learns the probabilistic mapping between CLIP vision features and the robot’s physical state, with a 2-D parametric bias that encodes time- and environment-dependent changes, updated online. Through basic and advanced experiments, the authors show that combining probabilistic prediction with online PB updates yields accurate, instruction-guided gaze control across varying objects and environments, outperforming models without PB. The results demonstrate practical potential for adaptive, linguistically guided perception in daily-assistive robotics and point to future multi-modal extensions and API integrations for broader task execution.

Abstract

In this study, we develop a simple daily assistive robot that controls its own vision according to linguistic instructions. The robot performs several daily tasks such as recording a user's face, hands, or screen, and remotely capturing images of desired locations. To construct such a robot, we combine a pre-trained large-scale vision-language model with a low-cost low-rigidity robot arm. The correlation between the robot's physical and visual information is learned probabilistically using a neural network, and changes in the probability distribution based on changes in time and environment are considered by parametric bias, which is a learnable network input variable. We demonstrate the effectiveness of this learning method by open-vocabulary view control experiments with an actual robot arm, MyCobot.

Daily Assistive View Control Learning of Low-Cost Low-Rigidity Robot via Large-Scale Vision-Language Model

TL;DR

This work addresses adapting a low-cost, low-rigidity robot arm to open-vocabulary view control driven by linguistic instructions. It proposes SPNPB, a stochastic predictive network that learns the probabilistic mapping between CLIP vision features and the robot’s physical state, with a 2-D parametric bias that encodes time- and environment-dependent changes, updated online. Through basic and advanced experiments, the authors show that combining probabilistic prediction with online PB updates yields accurate, instruction-guided gaze control across varying objects and environments, outperforming models without PB. The results demonstrate practical potential for adaptive, linguistically guided perception in daily-assistive robotics and point to future multi-modal extensions and API integrations for broader task execution.

Abstract

In this study, we develop a simple daily assistive robot that controls its own vision according to linguistic instructions. The robot performs several daily tasks such as recording a user's face, hands, or screen, and remotely capturing images of desired locations. To construct such a robot, we combine a pre-trained large-scale vision-language model with a low-cost low-rigidity robot arm. The correlation between the robot's physical and visual information is learned probabilistically using a neural network, and changes in the probability distribution based on changes in time and environment are considered by parametric bias, which is a learnable network input variable. We demonstrate the effectiveness of this learning method by open-vocabulary view control experiments with an actual robot arm, MyCobot.
Paper Structure (11 sections, 3 equations, 8 figures)

This paper contains 11 sections, 3 equations, 8 figures.

Figures (8)

  • Figure 1: Open-vocabulary view control of a low-cost low-rigidity robot arm for daily assistive tasks. The lower figures show the image from the web camera attached to the arm-tip.
  • Figure 2: The system overview including Vision-Language Model CLIP, Data Collector, Network Trainer, PB Updater, and Controller.
  • Figure 3: The setup of the basic experiment. The upper figures show the changes in physical state (two changes in the angle of the web camera attached to the tip of the robot arm) and the lower figure shows the changes in environmental state (three changes in the arrangement of the five target objects).
  • Figure 4: The arrangement of the trained parametric bias and its trajectory during the online update of parametric bias regarding three environmental and physical states in the basic experiment.
  • Figure 5: The open-vocabulary view control of the basic experiment. The lower figures show the images from the web camera attached to the arm-tip.
  • ...and 3 more figures