Table of Contents
Fetching ...

Improving Robotic Manipulation Robustness via NICE Scene Surgery

Sajjad Pakdamansavoji, Mozhgan Pourkeshavarz, Adam Sigal, Zhiyuan Li, Rui Heng Yang, Amir Rasouli

TL;DR

Robotic manipulation policies struggle under visual distractors due to distribution shifts. The authors introduce NICE, a data-centric framework that edits real demonstration scenes by removing, restyling, or replacing distractors while preserving the target and action semantics, leveraging tools like Florence-2, SAM-2, LaMa, and diffusion-based inpainting. NICE generates diverse, realistic training variants that close the visual gap and improve downstream tasks such as spatial affordance prediction and manipulation in clutter. Real-world experiments show notable gains in accuracy and safety metrics, demonstrating scalable robustness without additional robot data or simulators.

Abstract

Learning robust visuomotor policies for robotic manipulation remains a challenge in real-world settings, where visual distractors can significantly degrade performance and safety. In this work, we propose an effective and scalable framework, Naturalistic Inpainting for Context Enhancement (NICE). Our method minimizes out-of-distribution (OOD) gap in imitation learning by increasing visual diversity through construction of new experiences using existing demonstrations. By utilizing image generative frameworks and large language models, NICE performs three editing operations, object replacement, restyling, and removal of distracting (non-target) objects. These changes preserve spatial relationships without obstructing target objects and maintain action-label consistency. Unlike previous approaches, NICE requires no additional robot data collection, simulator access, or custom model training, making it readily applicable to existing robotic datasets. Using real-world scenes, we showcase the capability of our framework in producing photo-realistic scene enhancement. For downstream tasks, we use NICE data to finetune a vision-language model (VLM) for spatial affordance prediction and a vision-language-action (VLA) policy for object manipulation. Our evaluations show that NICE successfully minimizes OOD gaps, resulting in over 20% improvement in accuracy for affordance prediction in highly cluttered scenes. For manipulation tasks, success rate increases on average by 11% when testing in environments populated with distractors in different quantities. Furthermore, we show that our method improves visual robustness, lowering target confusion by 6%, and enhances safety by reducing collision rate by 7%.

Improving Robotic Manipulation Robustness via NICE Scene Surgery

TL;DR

Robotic manipulation policies struggle under visual distractors due to distribution shifts. The authors introduce NICE, a data-centric framework that edits real demonstration scenes by removing, restyling, or replacing distractors while preserving the target and action semantics, leveraging tools like Florence-2, SAM-2, LaMa, and diffusion-based inpainting. NICE generates diverse, realistic training variants that close the visual gap and improve downstream tasks such as spatial affordance prediction and manipulation in clutter. Real-world experiments show notable gains in accuracy and safety metrics, demonstrating scalable robustness without additional robot data or simulators.

Abstract

Learning robust visuomotor policies for robotic manipulation remains a challenge in real-world settings, where visual distractors can significantly degrade performance and safety. In this work, we propose an effective and scalable framework, Naturalistic Inpainting for Context Enhancement (NICE). Our method minimizes out-of-distribution (OOD) gap in imitation learning by increasing visual diversity through construction of new experiences using existing demonstrations. By utilizing image generative frameworks and large language models, NICE performs three editing operations, object replacement, restyling, and removal of distracting (non-target) objects. These changes preserve spatial relationships without obstructing target objects and maintain action-label consistency. Unlike previous approaches, NICE requires no additional robot data collection, simulator access, or custom model training, making it readily applicable to existing robotic datasets. Using real-world scenes, we showcase the capability of our framework in producing photo-realistic scene enhancement. For downstream tasks, we use NICE data to finetune a vision-language model (VLM) for spatial affordance prediction and a vision-language-action (VLA) policy for object manipulation. Our evaluations show that NICE successfully minimizes OOD gaps, resulting in over 20% improvement in accuracy for affordance prediction in highly cluttered scenes. For manipulation tasks, success rate increases on average by 11% when testing in environments populated with distractors in different quantities. Furthermore, we show that our method improves visual robustness, lowering target confusion by 6%, and enhances safety by reducing collision rate by 7%.

Paper Structure

This paper contains 13 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: An overview of our NICE generative framework. NICE uses the existing robot demonstration data and performs replacement, restyling, or removal operations on distracting objects to generate new experiences.
  • Figure 2: An overview of the NICE framework. The method starts by detecting all objects, identifying the target, and segmenting distracting objects. The object of interest is then selected to perform one of the following operations. Removal dilates the selected mask to cover shadows and feeds it into a inpainting model to fill with background texture. Restyling uses a texture database and applies it to the selected mask to change the appearance of the distractor. Replacement uses a large-language model to generate object description, which is then fed into an image inpainting module to replace the distractor.
  • Figure 3: Examples of the object parsing step: (Left) input raw image, (Middle) object detection results using Florence-2, and (Right) segmentation results using SAM-2.
  • Figure 4: Examples of data enhancement using NICE on the Bridge data walke2023bridgedata. In each image pair, left is the original image and right is the edited one.
  • Figure 5: Real-world replication of editing operations used for validation of the realism of the NICE data. For each series of samples, the scene was populated with multiple objects. Then one at the time, each object was either removed, replaced with the same object with different color, or replaced with another object entirely.
  • ...and 4 more figures