Table of Contents
Fetching ...

Reflectance Estimation for Proximity Sensing by Vision-Language Models: Utilizing Distributional Semantics for Low-Level Cognition in Robotics

Masashi Osada, Gustavo A. Garcia Ricardez, Yosuke Suzuki, Tadahiro Taniguchi

TL;DR

The paper tackles the problem that object reflectance, crucial for calibrating proximity sensors in robotic grasping, is difficult to estimate from images alone. It experimentally investigates whether distributional semantics in text-only LLMs (GPT-3.5, GPT-4) and multimodal VLMs (CLIP) can estimate or improve reflectance estimation from language and image-language signals, respectively, using few-shot prompting and object descriptions. Results show GPT-4 achieves around 14.7% mean error with text alone, CLIP-based methods achieve around 11.8% with images, and multimodal fusion can further reduce error, demonstrating that distributional semantics and latent language structure enhance low-level cognition for robotics. The findings suggest that tacit linguistic knowledge embedded in LLMs/VLMs can generalize reflectance estimation to unseen objects, facilitate sensor calibration for grasping, and motivate extending these methods to other physical properties and denser reflectance mappings for improved manipulation.

Abstract

Large language models (LLMs) and vision-language models (VLMs) have been increasingly used in robotics for high-level cognition, but their use for low-level cognition, such as interpreting sensor information, remains underexplored. In robotic grasping, estimating the reflectance of objects is crucial for successful grasping, as it significantly impacts the distance measured by proximity sensors. We investigate whether LLMs can estimate reflectance from object names alone, leveraging the embedded human knowledge in distributional semantics, and if the latent structure of language in VLMs positively affects image-based reflectance estimation. In this paper, we verify that 1) LLMs such as GPT-3.5 and GPT-4 can estimate an object's reflectance using only text as input; and 2) VLMs such as CLIP can increase their generalization capabilities in reflectance estimation from images. Our experiments show that GPT-4 can estimate an object's reflectance using only text input with a mean error of 14.7%, lower than the image-only ResNet. Moreover, CLIP achieved the lowest mean error of 11.8%, while GPT-3.5 obtained a competitive 19.9% compared to ResNet's 17.8%. These results suggest that the distributional semantics in LLMs and VLMs increases their generalization capabilities, and the knowledge acquired by VLMs benefits from the latent structure of language.

Reflectance Estimation for Proximity Sensing by Vision-Language Models: Utilizing Distributional Semantics for Low-Level Cognition in Robotics

TL;DR

The paper tackles the problem that object reflectance, crucial for calibrating proximity sensors in robotic grasping, is difficult to estimate from images alone. It experimentally investigates whether distributional semantics in text-only LLMs (GPT-3.5, GPT-4) and multimodal VLMs (CLIP) can estimate or improve reflectance estimation from language and image-language signals, respectively, using few-shot prompting and object descriptions. Results show GPT-4 achieves around 14.7% mean error with text alone, CLIP-based methods achieve around 11.8% with images, and multimodal fusion can further reduce error, demonstrating that distributional semantics and latent language structure enhance low-level cognition for robotics. The findings suggest that tacit linguistic knowledge embedded in LLMs/VLMs can generalize reflectance estimation to unseen objects, facilitate sensor calibration for grasping, and motivate extending these methods to other physical properties and denser reflectance mappings for improved manipulation.

Abstract

Large language models (LLMs) and vision-language models (VLMs) have been increasingly used in robotics for high-level cognition, but their use for low-level cognition, such as interpreting sensor information, remains underexplored. In robotic grasping, estimating the reflectance of objects is crucial for successful grasping, as it significantly impacts the distance measured by proximity sensors. We investigate whether LLMs can estimate reflectance from object names alone, leveraging the embedded human knowledge in distributional semantics, and if the latent structure of language in VLMs positively affects image-based reflectance estimation. In this paper, we verify that 1) LLMs such as GPT-3.5 and GPT-4 can estimate an object's reflectance using only text as input; and 2) VLMs such as CLIP can increase their generalization capabilities in reflectance estimation from images. Our experiments show that GPT-4 can estimate an object's reflectance using only text input with a mean error of 14.7%, lower than the image-only ResNet. Moreover, CLIP achieved the lowest mean error of 11.8%, while GPT-3.5 obtained a competitive 19.9% compared to ResNet's 17.8%. These results suggest that the distributional semantics in LLMs and VLMs increases their generalization capabilities, and the knowledge acquired by VLMs benefits from the latent structure of language.
Paper Structure (20 sections, 13 figures, 8 tables)

This paper contains 20 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Reflectance estimation for proximity sensing. The LLMs and VLMs estimate the reflectance of objects whether from text (LLMs) or from images (VLMs). Together with the sensors' output and the intrinsic parameters of the sensors, the estimated reflectance is used to measure the distance to target objects in preparation for grasping.
  • Figure 2: Overview of the three groups of methods used in the experiments. a) LLMs undergo pre-training that is either text-based or involves both text and images, without additional training. b) VLMs employ text and image pre-training, with additional image-based training. c) Compared methods use image-only pre-training and additional image training.
  • Figure 3: Objects used in the experiments. The objects on the left were used during the training phase (training set), while those on the right were used during the testing phase (test set). The latter are unknown/unseen to the compared methods. Besides regular objects, we also conducted experiments on irregular objects.
  • Figure 4: Setup to obtain the reflectance (ground truth) of objects at known distances. Given $I_{\text{all}} = \alpha (d + d_{0}) ^ {-n}$, where $I_{\text{all}}$ is the current value obtained by combining the outputs of all phototransistor (measured) and $d_0$ and $n$ are intrinsic parameters (known), we can calculate the reflectance $\alpha$ if the distance between the object and the sensor $d$ is also known. We use the least squares method to obtain the ground truth value of $\alpha$ by placing the object at various distances.
  • Figure 5: Mean error of the compared methods.
  • ...and 8 more figures