Table of Contents
Fetching ...

IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Yiyang Ling, Karan Owalekar, Oluwatobiloba Adesanya, Erdem Bıyık, Daniel Seita

TL;DR

IMPACT tackles the challenge of performing robot manipulation in densely cluttered environments by allowing semantically acceptable contact. It leverages Vision-Language Models to infer per-object contact tolerances from scene images and uses these costs to build an anisotropic, directional safety map that guides a three-pronged motion primitive planner (Move, Rotate, Push). A contact-aware A* search then yields trajectories that minimize risk while enabling efficient contact with environmental objects when needed, and the approach is validated across extensive simulation and real-world experiments, including human judgments. The results show improved success rates, reduced contact duration, and trajectories that align better with human preferences, highlighting the practical potential of semantically informed, contact-tolerant planning for dense clutter scenarios.

Abstract

Motion planning involves determining a sequence of robot configurations to reach a desired pose, subject to movement and safety constraints. Traditional motion planning finds collision-free paths, but this is overly restrictive in clutter, where it may not be possible for a robot to accomplish a task without contact. In addition, contacts range from relatively benign (e.g. brushing a soft pillow) to more dangerous (e.g. toppling a glass vase), making it difficult to characterize which may be acceptable. In this paper, we propose IMPACT, a novel motion planning framework that uses Vision-Language Models (VLMs) to infer environment semantics, identifying which parts of the environment can best tolerate contact based on object properties and locations. Our approach generates an anisotropic cost map that encodes directional push safety. We pair this map with a contact-aware A* planner to find stable contact-rich paths. We perform experiments using 20 simulation and 10 real-world scenes and assess using task success rate, object displacements, and feedback from human evaluators. Our results over 3200 simulation and 200 real-world trials suggest that IMPACT enables efficient contact-rich motion planning in cluttered settings while outperforming alternative methods and ablations. Our project website is available at https://impact-planning.github.io/.

IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

TL;DR

IMPACT tackles the challenge of performing robot manipulation in densely cluttered environments by allowing semantically acceptable contact. It leverages Vision-Language Models to infer per-object contact tolerances from scene images and uses these costs to build an anisotropic, directional safety map that guides a three-pronged motion primitive planner (Move, Rotate, Push). A contact-aware A* search then yields trajectories that minimize risk while enabling efficient contact with environmental objects when needed, and the approach is validated across extensive simulation and real-world experiments, including human judgments. The results show improved success rates, reduced contact duration, and trajectories that align better with human preferences, highlighting the practical potential of semantically informed, contact-tolerant planning for dense clutter scenarios.

Abstract

Motion planning involves determining a sequence of robot configurations to reach a desired pose, subject to movement and safety constraints. Traditional motion planning finds collision-free paths, but this is overly restrictive in clutter, where it may not be possible for a robot to accomplish a task without contact. In addition, contacts range from relatively benign (e.g. brushing a soft pillow) to more dangerous (e.g. toppling a glass vase), making it difficult to characterize which may be acceptable. In this paper, we propose IMPACT, a novel motion planning framework that uses Vision-Language Models (VLMs) to infer environment semantics, identifying which parts of the environment can best tolerate contact based on object properties and locations. Our approach generates an anisotropic cost map that encodes directional push safety. We pair this map with a contact-aware A* planner to find stable contact-rich paths. We perform experiments using 20 simulation and 10 real-world scenes and assess using task success rate, object displacements, and feedback from human evaluators. Our results over 3200 simulation and 200 real-world trials suggest that IMPACT enables efficient contact-rich motion planning in cluttered settings while outperforming alternative methods and ablations. Our project website is available at https://impact-planning.github.io/.

Paper Structure

This paper contains 29 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: An example of a reaching task and object costs. The first row shows the difference between collision-free paths and paths with acceptable contact. Left: Collision-free paths prevent a “straight” path to the spice jar because of the toy bear and wine glass obstacles (each marked with a red "X"). Right: With semantically acceptable contact, the robot can successfully reach the spice jar by pushing the toy bear and avoiding the fragile wine glass. The second row shows the cost of each object generated by GPT-4o. Left: the original scene. Right: GPT-4o assigns different costs to objects, with the target assigned $-1$ (toy bear: $3$, spice jar: $-1$, wine glass: $8$).
  • Figure 2: Overview of IMPACT. There is a toy bear, a coffee cup and a glue bottle (target) on the table. The VLM receives an annotated image $I'$ and a language template prompt $\ell$ with object information from SAM2 ravi2024sam, and outputs costs for the three objects. We use a cost of $-1$ for the target object. We construct a 3D voxel grid $V$ using these costs and then flatten it to produce an anisotropic, contact-aware cost map $M'$. The contact-aware A* planner searches over three motion primitives in this map: Move, Push and Rotate to generate a trajectory. The planner's state space includes the robot's end-effector pose and the displaced positions of low-cost objects. These guide the robot to avoid the coffee cup but make contact with the toy bear at the appropriate direction to reach the glue bottle.
  • Figure 3: A* planner decision-making in two key scenarios, where the books and pack of chips are low-cost objects, while the mug and the stack of bowls are high-cost objects. The colored border around low cost objects visualizes anisotropic costs (red is unsafe, green is safe). On the left, the planner avoids a direct Push towards the stack of bowls (cost $\infty$) and instead chooses a low-cost Rotate to navigate between objects (cost $7.0$). On the right, by planning several steps ahead, it finds an efficient path by rotating and pushing stack of books (total cost $23.5$). It avoids a simpler but high-cost detour that only considers Move (cost $50.0$).
  • Figure 4: Our user study evaluation website interface. For each question, the human evaluates two videos of robot trajectories without knowing the underlying robotics method that caused each robot motion. For each video pair, they select which video is more preferable to them. To aid comparisons, we enable the users to sync the videos. We also allow the option of "Cannot Decide."
  • Figure 5: Examples of trajectories planned by IMPACT (top row) and LAPP (bottom row) in PyBullet simulation coumans2019. The obstacles are: a coke bottle, a pitcher, a sugar box and a pile of bowls. The target object is the mug behind the obstacles. The planned paths are shown in an overlaid green curve in each image. We also provide LAPP with a language instruction "Can collide with the pitcher and the sugar box." See Sec. \ref{['ssec:simulation_results']} for more details.
  • ...and 4 more figures