Exp-Force: Experience-Conditioned Pre-Grasp Force Selection with Vision-Language Models

Siqi Shang; Minchao Huang; Bill Fan; Lillian Chin

Exp-Force: Experience-Conditioned Pre-Grasp Force Selection with Vision-Language Models

Siqi Shang, Minchao Huang, Bill Fan, Lillian Chin

TL;DR

The proposed Exp-Force is an experience-conditioned framework that predicts the minimum feasible grasping force from a single RGB image and enables reliable and generalizable pre-grasp force selection by leveraging prior interaction experiences.

Abstract

Accurate pre-contact grasp force selection is critical for safe and reliable robotic manipulation. Adaptive controllers regulate force after contact but still require a reasonable initial estimate. Starting a grasp with too little force requires reactive adjustment, while starting a grasp with too high a force risks damaging fragile objects. This trade-off is particularly challenging for compliant grippers, whose contact mechanics are difficult to model analytically. We propose Exp-Force, an experience-conditioned framework that predicts the minimum feasible grasping force from a single RGB image. The method retrieves a small set of relevant prior grasping experiences and conditions a vision-language model on these examples for in-context inference, without analytic contact models or manually designed heuristics. On 129 object instances, ExpForce achieves a best-case MAE of 0.43 N, reducing error by 72% over zero-shot inference. In real-world tests on 30 unseen objects, it improves appropriate force selection rate from 63% to 87%. These results demonstrate that Exp-Force enables reliable and generalizable pre-grasp force selection by leveraging prior interaction experiences. http://expforcesubmission.github.io/Exp-Force-Website/

Exp-Force: Experience-Conditioned Pre-Grasp Force Selection with Vision-Language Models

TL;DR

Abstract

Paper Structure (16 sections, 7 equations, 7 figures, 1 table)

This paper contains 16 sections, 7 equations, 7 figures, 1 table.

INTRODUCTION
Methods
Object Description Generation
Experience Retrieval
Experience-Conditioned Force Inference
Experimental Setup
Experience Pool Data Collection
Offline Experiments
Real World Experiments
Offline Experiment Results
Comparison to Zero-Shot VLMs
Comparison to Supervised Baselines
Ablation Studies
In-context Sample Efficiency and Diminishing Returns
Real Robot Experiment Results
...and 1 more sections

Figures (7)

Figure 1: Overview. Exp-Force uses VLMs to accurately predict appropriate grasp forces using a single object-image, prior to grasp contact. We do this by providing a VLM with a small set of related experiences for in-context learning. When grasping an opened Cheez-It box, a zero-shot VLM overestimates the grasp force by 2.75x, crushing the box, while Exp-Force can lift the box without damaging it.
Figure 2: Exp-Force architecture. A single target object image $I_o$ from the wrist camera is first fed into the descriptor VLM to generate a text description $T_{o}$ of the object's physical properties related to robot grasping. Together with the object image, the text description is fed into the embedding model to obtain the object embedding $z_{o}$. The embedding is used to perform a cosine-similarity search against the Experience Pool. The top-$k$ examples are retrieved and supplied to the predictor VLM as in-context examples for estimating the grasping force $\hat{F}_o$. Each example contains the object name, mass, description, image, and ground-truth $F^\star$.
Figure 3: Evaluation objects, categories, and model performance across categories.(A) Representative objects used in our experiments, categorized by geometry, fragility, and weight. All scale bars are 20cm. (B) Objects weight versus minimum feasible grasping force $F^\star$. Dashed lines indicate bounds for the coefficient of friction $\mu$ that the zero-shot model uses in its force estimation. (C) Comparison of the Mean Absolute Error between zero-shot versus Exp-Force in each of the six object categories, across three vision-language models (GPT 5.2, Gemini-3-Flash, and Gemini-3.1-Pro). For each model, we chose the best $k$ value from our in-context sample efficiency ablation (Sec. \ref{['sec:k-sweep']}). The total count of objects within each category is also reported.
Figure 4: Real-world experiment setup.(A) The compliant tactile two-finger gripper with a wrist-mounted camera used for pre-grasp image capture and force execution. (B) Five selected objects from each of the six categories. All scale bars are 10cm.
Figure 5: Comparison of force prediction errors across three vision–language models (GPT-5.2, Gemini-3-flash, and Gemini-3.1-pro). Lower values correspond to more accurate force estimation. For the multimodal baseline, we utilize Qwen3-VL-Embedding (denoted as Qwen). For each model, distribution bars summarize the error statistics over five-fold cross-validation, where the thin whiskers denote one standard deviation.
...and 2 more figures

Exp-Force: Experience-Conditioned Pre-Grasp Force Selection with Vision-Language Models

TL;DR

Abstract

Exp-Force: Experience-Conditioned Pre-Grasp Force Selection with Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)