Table of Contents
Fetching ...

Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation

Sangmim Song, Sarath Kodagoda, Marc Carmichael, Karthick Thiyagarajan

TL;DR

Concept-Gated Visual Distillation establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter by enforcing strict attribute adherence.

Abstract

Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a "Precision-Reasoning Gap" in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process--combining cross-validation and spatial disambiguation--to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving critical spatial geometry and visual proprioception. Extensive evaluations in highly cluttered manipulation tasks demonstrate that CGVD prevents performance collapse. In environments with dense semantic distractors, our method significantly outperforms state-of-the-art baselines, achieving a 77.5% success rate compared to the baseline's 43.0%. By enforcing strict attribute adherence, CGVD establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter.

Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation

TL;DR

Concept-Gated Visual Distillation establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter by enforcing strict attribute adherence.

Abstract

Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a "Precision-Reasoning Gap" in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process--combining cross-validation and spatial disambiguation--to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving critical spatial geometry and visual proprioception. Extensive evaluations in highly cluttered manipulation tasks demonstrate that CGVD prevents performance collapse. In environments with dense semantic distractors, our method significantly outperforms state-of-the-art baselines, achieving a 77.5% success rate compared to the baseline's 43.0%. By enforcing strict attribute adherence, CGVD establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter.
Paper Structure (22 sections, 5 equations, 4 figures, 3 tables)

This paper contains 22 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison of manipulation task execution in cluttered environments. While a standard VLA model (left) struggles with object confusion in a highly cluttered scene, our CGVD approach (right) successfully identifies and places the target object ("spoon") on the towel.
  • Figure 2: Overview of the CGVD pipeline. Stage 1: The language instruction is parsed to extract a safe set tar and a distractor set. Stage 2: SAM3 segments both sets independently, producing a safe-set mask and a distractor mask via dual-channel segmentation. Stage 3: Set-theoretic gating subtracts the safe-set mask from the distractor mask, and LaMa inpaints the resulting regions to produce a distilled observation passed to the VLA policy.
  • Figure 3: Success rate vs. number of distractors. Left: semantic distractors. Right: random distractors. Top: spoon on towel. Bottom: carrot on plate. Dashed lines represent the baseline VLA; solid lines represent +CGVD. Colors denote specific model architectures. To ensure statistical significance, each data point represents the average success rate over 200 independent evaluation rollouts (20 episodes $\times$ 10 random seeds), totaling 19,200 episodes for the results visualized in this figure.
  • Figure 4: Qualitative Analysis of Attention Repair. (Top) The baseline policy suffers from attention dispersion, focusing on the distractors rather than the spoon. (Bottom) Our CGVD method inpaints the distractors, forcing the attention mechanism to collapse onto the true target.