Table of Contents
Fetching ...

Gentle Object Retraction in Dense Clutter Using Multimodal Force Sensing and Imitation Learning

Dane Brouwer, Joshua Citron, Heather Nolte, Jeannette Bohg, Mark Cutkosky

TL;DR

This paper tackles the challenge of safely retracting objects from densely cluttered environments by embracing contact rather than avoiding it. It introduces a sensorized end-effector and a diffusion-based imitation-learning framework that fuses eye-in-hand vision, proprioception, non-prehensile tactile sensing, wrench estimates, and suction-pressure signals. Through a force ablation study on 100 demonstrations across highly variable clutter scenes, the authors show that incorporating force modalities reduces excessive-contact events and improves both success rate and speed, with the combination of wrench and tactile sensing delivering the strongest gains (up to 80% over a no-force baseline). The work highlights the value of multimodal sensing for contact-rich manipulation in constrained spaces and suggests future directions in kinesthetic feedback, adaptive gentleness, and broader generalization to unseen objects and environments.

Abstract

Dense collections of movable objects are common in everyday spaces-from cabinets in a home to shelves in a warehouse. Safely retracting objects from such collections is difficult for robots, yet people do it frequently, leveraging learned experience in tandem with vision and non-prehensile tactile sensing on the sides and backs of their hands and arms. We investigate the role of contact force sensing for training robots to gently reach into constrained clutter and extract objects. The available sensing modalities are (1) "eye-in-hand" vision, (2) proprioception, (3) non-prehensile triaxial tactile sensing, (4) contact wrenches estimated from joint torques, and (5) a measure of object acquisition obtained by monitoring the vacuum line of a suction cup. We use imitation learning to train policies from a set of demonstrations on randomly generated scenes, then conduct an ablation study of wrench and tactile information. We evaluate each policy's performance across 40 unseen environment configurations. Policies employing any force sensing show fewer excessive force failures, an increased overall success rate, and faster completion times. The best performance is achieved using both tactile and wrench information, producing an 80% improvement above the baseline without force information.

Gentle Object Retraction in Dense Clutter Using Multimodal Force Sensing and Imitation Learning

TL;DR

This paper tackles the challenge of safely retracting objects from densely cluttered environments by embracing contact rather than avoiding it. It introduces a sensorized end-effector and a diffusion-based imitation-learning framework that fuses eye-in-hand vision, proprioception, non-prehensile tactile sensing, wrench estimates, and suction-pressure signals. Through a force ablation study on 100 demonstrations across highly variable clutter scenes, the authors show that incorporating force modalities reduces excessive-contact events and improves both success rate and speed, with the combination of wrench and tactile sensing delivering the strongest gains (up to 80% over a no-force baseline). The work highlights the value of multimodal sensing for contact-rich manipulation in constrained spaces and suggests future directions in kinesthetic feedback, adaptive gentleness, and broader generalization to unseen objects and environments.

Abstract

Dense collections of movable objects are common in everyday spaces-from cabinets in a home to shelves in a warehouse. Safely retracting objects from such collections is difficult for robots, yet people do it frequently, leveraging learned experience in tandem with vision and non-prehensile tactile sensing on the sides and backs of their hands and arms. We investigate the role of contact force sensing for training robots to gently reach into constrained clutter and extract objects. The available sensing modalities are (1) "eye-in-hand" vision, (2) proprioception, (3) non-prehensile triaxial tactile sensing, (4) contact wrenches estimated from joint torques, and (5) a measure of object acquisition obtained by monitoring the vacuum line of a suction cup. We use imitation learning to train policies from a set of demonstrations on randomly generated scenes, then conduct an ablation study of wrench and tactile information. We evaluate each policy's performance across 40 unseen environment configurations. Policies employing any force sensing show fewer excessive force failures, an increased overall success rate, and faster completion times. The best performance is achieved using both tactile and wrench information, producing an 80% improvement above the baseline without force information.

Paper Structure

This paper contains 14 sections, 3 equations, 7 figures.

Figures (7)

  • Figure 1: a) An end-effector equipped with soft, triaxial tactile sensors, a suction cup, and a camera. b) The "eye-in-hand" camera view provided as observations to a robot reaching in dense clutter to acquire a target. c) Overhead view (not available to robot) of the scene in (b). Robot's initial pose (blue arrow) causes objects to jam against the left wall, producing a large contact force (red arrow). Action sequence moves the robot to a new pose (green arrow) that is out of contact while still approaching the target.
  • Figure 2: a) Images of the left and right side of the end-effector, along with tactile array coordinate frames, indicating how tactile visualizations map to locations on the end-effector. b) Distributed triaxial force information represented with normal force proportional to circle diameter and shear forces proportional to arrow magnitude and direction. c) Corresponding tactile image with x, y, and z forces mapped to B, G, and R channels, respectively.
  • Figure 3: Our force-informed diffusion policy network processes both input images---"eye-in-hand" camera and tactile---through its own pre-trained ResNet-18 encoder. The output features are directly concatenated with normalized low-dimensional sensor modalities: wrench, proprioception, and pressure. This forms one observation, which is combined with previous observations and passed into a diffusion head to perform action prediction.
  • Figure 4: a) Schematic of randomly generated configuration of cluttered objects with corresponding b) overhead view of the physical configuration, c) external view of the scene, and d) "eye-in-hand" camera view provided to the robot. Each scene has 25-28 obstacles of 4 types---blue, green, black, and yellow---and one of 3 possible red target objects.
  • Figure 5: a) Demonstrator using a spacemouse to control robot motion while observing b) visual, tactile, excessive force warnings, and pressure information provided on screen. The contact on the right side of the end-effector, which is unseen in the camera view, is causing a large peak force as indicated by the red rectangle. The demonstrator does not have an external view of the physical scene.
  • ...and 2 more figures