Table of Contents
Fetching ...

HSC-VLA: Hierarchical Scene-Clearing for Robust Bimanual Manipulation in Dense Clutter

Zhen Liu, Xinyu Ning, Zhe Hu, XinXin Xie, Yitong Liu, Zhongzhu Pu

TL;DR

HSC-VLA is proposed, a hierarchical framework that decouples high-level visual-semantic reasoning from low-level, high-frequency sensorimotor execution through an explicit scene-clearing abstraction and exhibits strong robustness and effective failure recovery in complex cluttered manipulation.

Abstract

Modern Vision--Language--Action models often suffer from critical instruction-following failures in high-density manipulation environments, where task-irrelevant visual clutter dilutes attention, corrupts grounding, and substantially degrades performance in complex long-horizon scenarios. To overcome the representation bottleneck of monolithic end-to-end architectures, we propose HSC-VLA, a hierarchical framework that decouples high-level visual-semantic reasoning from low-level, high-frequency sensorimotor execution through an explicit scene-clearing abstraction. HSC-VLA employs a high-level Brain to decompose long-horizon tasks and to generate task-specific scene masks that preserve task-relevant geometry while suppressing distractors. The filtered observations are then passed to a low-level Cerebellum, a diffusion-based policy that performs bimanual manipulation using only mask-filtered vision and proprioception. Extensive experiments in densely cluttered supermarket shelves demonstrate that HSC-VLA achieves 86.7\% aggregate success under high-density clutter, surpassing the best monolithic baseline ($π_0$-Full FT at 34.3\%) by 52.4\%. HSC-VLA also exhibits strong long-horizon performance, reaching 72\% on clutter sorting and 66\% on restocking, demonstrating strong robustness and effective failure recovery in complex cluttered manipulation.

HSC-VLA: Hierarchical Scene-Clearing for Robust Bimanual Manipulation in Dense Clutter

TL;DR

HSC-VLA is proposed, a hierarchical framework that decouples high-level visual-semantic reasoning from low-level, high-frequency sensorimotor execution through an explicit scene-clearing abstraction and exhibits strong robustness and effective failure recovery in complex cluttered manipulation.

Abstract

Modern Vision--Language--Action models often suffer from critical instruction-following failures in high-density manipulation environments, where task-irrelevant visual clutter dilutes attention, corrupts grounding, and substantially degrades performance in complex long-horizon scenarios. To overcome the representation bottleneck of monolithic end-to-end architectures, we propose HSC-VLA, a hierarchical framework that decouples high-level visual-semantic reasoning from low-level, high-frequency sensorimotor execution through an explicit scene-clearing abstraction. HSC-VLA employs a high-level Brain to decompose long-horizon tasks and to generate task-specific scene masks that preserve task-relevant geometry while suppressing distractors. The filtered observations are then passed to a low-level Cerebellum, a diffusion-based policy that performs bimanual manipulation using only mask-filtered vision and proprioception. Extensive experiments in densely cluttered supermarket shelves demonstrate that HSC-VLA achieves 86.7\% aggregate success under high-density clutter, surpassing the best monolithic baseline (-Full FT at 34.3\%) by 52.4\%. HSC-VLA also exhibits strong long-horizon performance, reaching 72\% on clutter sorting and 66\% on restocking, demonstrating strong robustness and effective failure recovery in complex cluttered manipulation.
Paper Structure (18 sections, 12 equations, 4 figures, 1 table)

This paper contains 18 sections, 12 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: How VLM Improves Perception and Planning in Embodied AI. [Upper Row]: VLM-assisted segmentation reduces visual ambiguity by masking irrelevant objects, enabling precise target identification in cluttered environments. [Lower Row]: VLM-assisted planning decomposes high-level, long-horizon instructions into clear, executable steps (Identify, Grasp, Place, and Navigate). The VLM Brain functions as both a semantic filter for perception and a structured reasoner for decision-making.
  • Figure 2: Overview of the HCS-VLA. Our framework decomposes a high-level natural language goal into a sequence of executable subgoals via a Hierarchical Planner. These subgoals interface with a Tool Library and a vision-based pipeline—integrating SAM/Cutie for segmentation and a Pretrained VLM—to provide multimodal instructions to an Action Expert for precise robot execution.
  • Figure 3: Robot system used in real-world evaluations.
  • Figure 4: Visualization of scene filtering during task execution. Left: masked observation used by the policy. Right: attention evolution in simulation and real-world trials. Together, these results show that the proposed masking mechanism effectively suppresses irrelevant clutter and improves task-focused perception.