Table of Contents
Fetching ...

Choose What to Observe: Task-Aware Semantic-Geometric Representations for Visuomotor Policy

Haoran Ding, Liang Ma, Yaxun Yang, Wen Yang, Tianyu Liu, Anqing Duan, Xiaodan Liang, Dezhen Song, Ivan Laptev, Yoshihiko Nakamura

TL;DR

A task-aware observation interface is proposed that canonicalizes visual input into a shared representation, improving robustness to out-of-distribution (OOD) appearance changes without modifying or fine-tuning the policy.

Abstract

Visuomotor policies learned from demonstrations often overfit to nuisance visual factors in raw RGB observations, resulting in brittle behavior under appearance shifts such as background changes and object recoloring. We propose a task-aware observation interface that canonicalizes visual input into a shared representation, improving robustness to out-of-distribution (OOD) appearance changes without modifying or fine-tuning the policy. Given an RGB image and an open-vocabulary specification of task-relevant entities, we use SAM3 to segment the target object and robot/gripper. We construct an L0 observation by repainting segmented entities with predefined semantic colors on a constant background. For tasks requiring stronger geometric cues, we further inject monocular depth from Depth Anything 3 into the segmented regions via depth-guided overwrite, yielding a unified semantic--geometric observation (L1) that remains a standard 3-channel, image-like input. We evaluate on RoboMimic (Lift), ManiSkill YCB grasping under clutter, four RLBench tasks under controlled appearance shifts, and two real-world Franka tasks (ReachX and CloseCabinet). Across benchmarks and policy backbones (Flow Matching Policy and SmolVLA), our interface preserves in-distribution performance while substantially improving robustness under OOD visual shifts.

Choose What to Observe: Task-Aware Semantic-Geometric Representations for Visuomotor Policy

TL;DR

A task-aware observation interface is proposed that canonicalizes visual input into a shared representation, improving robustness to out-of-distribution (OOD) appearance changes without modifying or fine-tuning the policy.

Abstract

Visuomotor policies learned from demonstrations often overfit to nuisance visual factors in raw RGB observations, resulting in brittle behavior under appearance shifts such as background changes and object recoloring. We propose a task-aware observation interface that canonicalizes visual input into a shared representation, improving robustness to out-of-distribution (OOD) appearance changes without modifying or fine-tuning the policy. Given an RGB image and an open-vocabulary specification of task-relevant entities, we use SAM3 to segment the target object and robot/gripper. We construct an L0 observation by repainting segmented entities with predefined semantic colors on a constant background. For tasks requiring stronger geometric cues, we further inject monocular depth from Depth Anything 3 into the segmented regions via depth-guided overwrite, yielding a unified semantic--geometric observation (L1) that remains a standard 3-channel, image-like input. We evaluate on RoboMimic (Lift), ManiSkill YCB grasping under clutter, four RLBench tasks under controlled appearance shifts, and two real-world Franka tasks (ReachX and CloseCabinet). Across benchmarks and policy backbones (Flow Matching Policy and SmolVLA), our interface preserves in-distribution performance while substantially improving robustness under OOD visual shifts.
Paper Structure (36 sections, 7 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 36 sections, 7 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of our observation interface: Given an RGB frame and open-vocabulary text prompts (robot/gripper, target object), we construct two observation variants. L0 uses SAM3 masks to render a canonical label-colored image (constant background; fixed colors for robot/gripper and target). L1 optionally injects geometry by overwriting the masked regions with normalized monocular depth (Depth Anything 3). Both outputs remain standard 3-channel, image-like inputs for off-the-shelf vision encoders. In our main experiments, we pair the interface with an FMP backbone (1D U-Net unchanged), without modifying the policy architecture.
  • Figure 2: Evaluation environments under controlled appearance shifts. Each row shows the in-distribution (ID) training setting and three held-out test variants (OOD1--3), with task dynamics unchanged. Row 1: RoboMimic Lift under object appearance shifts (OOD-Obj): ID uses the training cube appearance, while OOD1--3 recolor the cube. Row 2: ManiSkill YCB grasping under increasing clutter: ID is the uncluttered training scene; OOD1--3 introduce novel distractor object sets/layouts. Row 3: RLBench tabletop background shifts (OOD-Bg): ID uses the training tabletop appearance, while OOD1--3 change the tabletop color.
  • Figure 3: Real-robot evaluation scenes under controlled appearance shifts. We evaluate two tasks on a Franka arm under one in-distribution condition (ID) and two held-out support-surface/background appearances (OOD1--2). Top row: ReachX (reaching to a target marker). Bottom row: CloseCabinet (closing the cabinet). Across columns, task setup and camera viewpoint are kept fixed; only the visual appearance of the support surface/background is changed.
  • Figure 4: Qualitative comparison of SAM3 segmentation before/after LoRA fine-tuning. Left: wrist-camera RGB observation from an OOD tabletop-color setting (see Fig. \ref{['fig:exp_envs']}). Middle: pretrained SAM3 often misses parts of the robot/gripper and produces incomplete masks. Right: our LoRA-finetuned SAM3 produces accurate robot/gripper and object masks under the same OOD setting.