Table of Contents
Fetching ...

Towards Forceful Robotic Foundation Models: a Literature Survey

William Xie, Nikolaus Correll

TL;DR

This survey analyzes how force and tactile sensing are integrated into end-to-end robot policy learning, arguing that current robot foundation models largely rely on vision and position control and may miss critical dexterity without force data. It surveys 25 works using diffusion and transformer architectures to learn tactile policies, examining data collection, action spaces, and representation learning, and highlighting how force representations can improve manipulation, especially in contact-rich tasks. Key findings include the predominance of GelSight visuotactile sensing, the central role of teleoperation for data, and the mixed evidence that explicit force inputs are always required, though explicit force control can yield substantial gains in performance and robustness. The authors emphasize the need for scalable tactile data, force-inclusive pretraining, and careful consideration of when explicit force representations are necessary, framing a path toward tactile robot foundation models capable of handling high-dynamic, contact-rich manipulation in real-world settings.

Abstract

This article reviews contemporary methods for integrating force, including both proprioception and tactile sensing, in robot manipulation policy learning. We conduct a comparative analysis on various approaches for sensing force, data collection, behavior cloning, tactile representation learning, and low-level robot control. From our analysis, we articulate when and why forces are needed, and highlight opportunities to improve learning of contact-rich, generalist robot policies on the path toward highly capable touch-based robot foundation models. We generally find that while there are few tasks such as pouring, peg-in-hole insertion, and handling delicate objects, the performance of imitation learning models is not at a level of dynamics where force truly matters. Also, force and touch are abstract quantities that can be inferred through a wide range of modalities and are often measured and controlled implicitly. We hope that juxtaposing the different approaches currently in use will help the reader to gain a systemic understanding and help inspire the next generation of robot foundation models.

Towards Forceful Robotic Foundation Models: a Literature Survey

TL;DR

This survey analyzes how force and tactile sensing are integrated into end-to-end robot policy learning, arguing that current robot foundation models largely rely on vision and position control and may miss critical dexterity without force data. It surveys 25 works using diffusion and transformer architectures to learn tactile policies, examining data collection, action spaces, and representation learning, and highlighting how force representations can improve manipulation, especially in contact-rich tasks. Key findings include the predominance of GelSight visuotactile sensing, the central role of teleoperation for data, and the mixed evidence that explicit force inputs are always required, though explicit force control can yield substantial gains in performance and robustness. The authors emphasize the need for scalable tactile data, force-inclusive pretraining, and careful consideration of when explicit force representations are necessary, framing a path toward tactile robot foundation models capable of handling high-dynamic, contact-rich manipulation in real-world settings.

Abstract

This article reviews contemporary methods for integrating force, including both proprioception and tactile sensing, in robot manipulation policy learning. We conduct a comparative analysis on various approaches for sensing force, data collection, behavior cloning, tactile representation learning, and low-level robot control. From our analysis, we articulate when and why forces are needed, and highlight opportunities to improve learning of contact-rich, generalist robot policies on the path toward highly capable touch-based robot foundation models. We generally find that while there are few tasks such as pouring, peg-in-hole insertion, and handling delicate objects, the performance of imitation learning models is not at a level of dynamics where force truly matters. Also, force and touch are abstract quantities that can be inferred through a wide range of modalities and are often measured and controlled implicitly. We hope that juxtaposing the different approaches currently in use will help the reader to gain a systemic understanding and help inspire the next generation of robot foundation models.

Paper Structure

This paper contains 23 sections, 3 equations, 9 figures.

Figures (9)

  • Figure 1: A: Differential tactile data during a sequence of grasping events. Gripper aperture (top row), pressure sensing in the left and right finger tip (2nd row), the derivative of the pressure signal (3rd row), and accelerometer at the wrist (4th row), from patel2017improving. B: Force (top) and torque (bottom) data over time during a successful bearing insertion from watson2020autonomous. C: High-resolution tactile information from a GelSight sensor from calandra2017feeling.
  • Figure 2: Touch sensing can be represented across fine-grained fingers to the whole robot arm as forces.
  • Figure 3: We plot force magnitude against task length time for 64 tasks (of which 53 are unique) across 25 papers implementing tactile robot policies.
  • Figure 4: On the same force-time axes, we describe the 64 tasks learned by the 25 papers, with 53 unique tasks.
  • Figure 5: Across the reviewed papers, we categorize touch or tactile sensors across six categories: audio, force, or optical (visuotactile) sensing at the end effector "fingers," joint torque sensing along the whole robot arm, combined sensing from the end effector and joint torques, and wrist F/T sensing.
  • ...and 4 more figures