Table of Contents
Fetching ...

Semantic-Contact Fields for Category-Level Generalizable Tactile Tool Manipulation

Kevin Yuchen Ma, Heng Zhang, Weisi Lin, Mike Zheng Shou, Yan Wu

TL;DR

Semantic-Contact Fields is proposed, a unified 3D representation fusing visual semantics with dense contact estimates that allows physical generalization to unseen tools to enable robust execution of contact-rich tool manipulation tasks.

Abstract

Generalizing tool manipulation requires both semantic planning and precise physical control. Modern generalist robot policies, such as Vision-Language-Action (VLA) models, often lack the high-fidelity physical grounding required for contact-rich tool manipulation. Conversely, existing contact-aware policies that leverage tactile or haptic sensing are typically instance-specific and fail to generalize across diverse tool geometries. Bridging this gap requires learning unified contact representations from diverse data, yet a fundamental barrier remains: diverse real-world tactile data are prohibitive at scale, while direct zero-shot sim-to-real transfer is challenging due to the complex dynamics of nonlinear deformation of soft sensors. To address this, we propose Semantic-Contact Fields (SCFields), a unified 3D representation fusing visual semantics with dense contact estimates. We enable this via a two-stage Sim-to-Real Contact Learning Pipeline: first, we pre-train on a large simulation data set to learn general contact physics; second, we fine-tune on a small set of real data, pseudo-labeled via geometric heuristics and force optimization, to align sensor characteristics. This allows physical generalization to unseen tools. We leverage SCFields as the dense observation input for a diffusion policy to enable robust execution of contact-rich tool manipulation tasks. Experiments on scraping, crayon drawing, and peeling demonstrate robust category-level generalization, significantly outperforming vision-only and raw-tactile baselines.

Semantic-Contact Fields for Category-Level Generalizable Tactile Tool Manipulation

TL;DR

Semantic-Contact Fields is proposed, a unified 3D representation fusing visual semantics with dense contact estimates that allows physical generalization to unseen tools to enable robust execution of contact-rich tool manipulation tasks.

Abstract

Generalizing tool manipulation requires both semantic planning and precise physical control. Modern generalist robot policies, such as Vision-Language-Action (VLA) models, often lack the high-fidelity physical grounding required for contact-rich tool manipulation. Conversely, existing contact-aware policies that leverage tactile or haptic sensing are typically instance-specific and fail to generalize across diverse tool geometries. Bridging this gap requires learning unified contact representations from diverse data, yet a fundamental barrier remains: diverse real-world tactile data are prohibitive at scale, while direct zero-shot sim-to-real transfer is challenging due to the complex dynamics of nonlinear deformation of soft sensors. To address this, we propose Semantic-Contact Fields (SCFields), a unified 3D representation fusing visual semantics with dense contact estimates. We enable this via a two-stage Sim-to-Real Contact Learning Pipeline: first, we pre-train on a large simulation data set to learn general contact physics; second, we fine-tune on a small set of real data, pseudo-labeled via geometric heuristics and force optimization, to align sensor characteristics. This allows physical generalization to unseen tools. We leverage SCFields as the dense observation input for a diffusion policy to enable robust execution of contact-rich tool manipulation tasks. Experiments on scraping, crayon drawing, and peeling demonstrate robust category-level generalization, significantly outperforming vision-only and raw-tactile baselines.
Paper Structure (40 sections, 9 equations, 12 figures, 8 tables)

This paper contains 40 sections, 9 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Semantic-Contact Fields (SCFields) Overview.1. Multimodal Inputs: The system takes RGB-D observations and tactile readings from GelSight sensors. 2. SCFields Generation: Our unified perception module fuses these inputs into a dense point cloud representation containing both category-level semantics (blue/green heatmap) and extrinsic contact force vectors (green arrows). 3. Policy Execution: A diffusion policy conditioned on the SCFields enables zero-shot generalization to novel tools variants (e.g., peelers of different shapes) in contact-rich tasks by reasoning about functional affordance and contact forces simultaneously.
  • Figure 2: Method Overview.Left: Contact Field Learning (\ref{['subsec:contact_field_estimation']}) Stage 1 learns the general geometry and contact physics in simulated data; Stage 2 aligns sensor domain with pseudo-labeled real data. Right: Policy Learning (\ref{['subsec:scfield_policy']}) A Diffusion Policy is trained conditioned on the combined SCFields to achieve robust tool manipulation.
  • Figure 3: Contact field model architecture. The network fuses tactile markers and force arrays with dense object geometry in a unified point cloud input to predict contact fields.
  • Figure 4: Left: Real robot experiment setup: We use a Franka Emika Panda robot with 2 Gelsight Mini tactile sensors mounted on the gripper fingers, and 3 RealSense D435 cameras to capture RGBD observations. Right: Training and Testing Tools
  • Figure 5: Qualitative comparison of Contact Field estimation on the Peeler. Left: Ours (clean contact forces on blade), Middle: Sim-Only Model (missing forces). Right: Real-Only Model (noisy forces). Bottom Right: Signal Correlation Plot: Torque from Predicted Contact Force (Ours, Sim-Only, Real-Only) vs. Reference Wrench from Tactile Signal. Ours best best align with the reference tactile wrench while Real-Only is noisy even when there is no contact
  • ...and 7 more figures