Table of Contents
Fetching ...

Distracted Robot: How Visual Clutter Undermine Robotic Manipulation

Amir Rasouli, Montgomery Alban, Sajjad Pakdamansavoji, Zhiyuan Li, Zhanguang Zhang, Aaron Wu, Xuan Zhao

TL;DR

This study introduces a psychophysical evaluation framework for robotic manipulation under clutter, grounded in a unified Dual-view Feature Congestion (DvFC) clutter metric. It systematically generates diverse cluttered scenarios in SIMPLER and real-world settings to assess five vision-language-action policies across six manipulation tasks, revealing that clutter materially degrades performance and that policies exhibit complementary strengths and vulnerabilities. The authors demonstrate that DvFC tracks performance degradation and that targeted data augmentation can improve robustness, though gains are not universal. Overall, the work highlights the need for more robust clutter-handling strategies beyond data scaling in practical robotic manipulation.

Abstract

In this work, we propose an evaluation protocol for examining the performance of robotic manipulation policies in cluttered scenes. Contrary to prior works, we approach evaluation from a psychophysical perspective, therefore we use a unified measure of clutter that accounts for environmental factors as well as the distractors quantity, characteristics, and arrangement. Using this measure, we systematically construct evaluation scenarios in both hyper-realistic simulation and real-world and conduct extensive experimentation on manipulation policies, in particular vision-language-action (VLA) models. Our experiments highlight the significant impact of scene clutter, lowering the performance of the policies, by as much as 34% and show that despite achieving similar average performance across the tasks, different VLA policies have unique vulnerabilities and a relatively low agreement on success scenarios. We further show that our clutter measure is an effective indicator of performance degradation and analyze the impact of distractors in terms of their quantity and occluding influence. At the end, we show that finetuning on enhanced data, although effective, does not equally remedy all negative impacts of clutter on performance.

Distracted Robot: How Visual Clutter Undermine Robotic Manipulation

TL;DR

This study introduces a psychophysical evaluation framework for robotic manipulation under clutter, grounded in a unified Dual-view Feature Congestion (DvFC) clutter metric. It systematically generates diverse cluttered scenarios in SIMPLER and real-world settings to assess five vision-language-action policies across six manipulation tasks, revealing that clutter materially degrades performance and that policies exhibit complementary strengths and vulnerabilities. The authors demonstrate that DvFC tracks performance degradation and that targeted data augmentation can improve robustness, though gains are not universal. Overall, the work highlights the need for more robust clutter-handling strategies beyond data scaling in practical robotic manipulation.

Abstract

In this work, we propose an evaluation protocol for examining the performance of robotic manipulation policies in cluttered scenes. Contrary to prior works, we approach evaluation from a psychophysical perspective, therefore we use a unified measure of clutter that accounts for environmental factors as well as the distractors quantity, characteristics, and arrangement. Using this measure, we systematically construct evaluation scenarios in both hyper-realistic simulation and real-world and conduct extensive experimentation on manipulation policies, in particular vision-language-action (VLA) models. Our experiments highlight the significant impact of scene clutter, lowering the performance of the policies, by as much as 34% and show that despite achieving similar average performance across the tasks, different VLA policies have unique vulnerabilities and a relatively low agreement on success scenarios. We further show that our clutter measure is an effective indicator of performance degradation and analyze the impact of distractors in terms of their quantity and occluding influence. At the end, we show that finetuning on enhanced data, although effective, does not equally remedy all negative impacts of clutter on performance.

Paper Structure

This paper contains 12 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: A typical manipulation scenario where the robot is asked to fetch an object (e.g. an apple). The green trajectory shows the expected behavior and black shows the executed one. Distractors in the scene can cause target confusion (e.g. with the orange), as well as collision and grasping failure.
  • Figure 2: Examples of various synthetic (top) and real (bottom) cluttered scenes with corresponding DvFC values.
  • Figure 3: Per-task success rate of the policies. Each axis of the radar diagram shows one of the 6 core tasks.
  • Figure 4: The Venn diagram of the success scenarios of the policies. The numbers show the percentage of total success scenarios combined for the policies.
  • Figure 5: (left) Percentage of failures for the policies. Values for each policy are normalized to sum to 1. Lower values are better. (right) Qualitative examples of failure cases. Targets are identified with red ovals.
  • ...and 5 more figures