VCA: Vision-Click-Action Framework for Precise Manipulation of Segmented Objects in Target Ambiguous Environments

Donggeon Kim; Seungwon Jan; Hyeonjun Park; Daegyu Lim

VCA: Vision-Click-Action Framework for Precise Manipulation of Segmented Objects in Target Ambiguous Environments

Donggeon Kim, Seungwon Jan, Hyeonjun Park, Daegyu Lim

TL;DR

Experimental results validate that the proposed VCA framework achieves effective instance-level manipulation of specified target objects and provides a practical and scalable alternative to language-driven interfaces for real-world robotic manipulation.

Abstract

The reliance on language in Vision-Language-Action (VLA) models introduces ambiguity, cognitive overhead, and difficulties in precise object identification and sequential task execution, particularly in environments with multiple visually similar objects. To address these limitations, we propose Vision-Click-Action (VCA), a framework that replaces verbose textual commands with direct, click-based visual interaction using pretrained segmentation models. By allowing operators to specify target objects clearly through visual selection in the robot's 2D camera view, VCA reduces interpretation errors, lowers cognitive load, and provides a practical and scalable alternative to language-driven interfaces for real-world robotic manipulation. Experimental results validate that the proposed VCA framework achieves effective instance-level manipulation of specified target objects. Experiment videos are available at https://robrosinc.github.io/vca/.

VCA: Vision-Click-Action Framework for Precise Manipulation of Segmented Objects in Target Ambiguous Environments

TL;DR

Abstract

Paper Structure (12 sections, 1 equation, 7 figures, 4 tables)

This paper contains 12 sections, 1 equation, 7 figures, 4 tables.

INTRODUCTION
Vision Click Action
Overall Architecture of VCA
Real-Time SAM-2 Adaptation
VCA Architecture Details
EXPERIMENTS
Experimental Setup
Block Sorting Task
Tower of Hanoi Task
Generalization Under Visual Distribution Shift
Behavioral Analysis
CONCLUSIONS

Figures (7)

Figure 1: How would you describe the circled objects in this figure—and how long would it take?
Figure 2: Overview of VCA. A user click generates an object mask via real-time SAM2, which is encoded with multi-view RGB observations and robot proprioception. A transformer encoder fuses these tokens, and a transformer decoder predicts a fixed-length action chunk using learned action queries.
Figure 3: Memory bank update mechanism. Upon a new prompt at timestep t, the corresponding class memory is overwritten, and the memory bank is reset to start from the current frame.
Figure 4: Experimental hardware setup. The system includes two 6-DoF robot arms with grippers, three RGB cameras (head and wrist-mounted), and a monitor-based clicking interface. Data are collected via master arm teleoperation.
Figure 5: Experimental environments for the block sorting task: (a) basic, (b) new color, (c) checkered background, and (d) plaid background.
...and 2 more figures

VCA: Vision-Click-Action Framework for Precise Manipulation of Segmented Objects in Target Ambiguous Environments

TL;DR

Abstract

VCA: Vision-Click-Action Framework for Precise Manipulation of Segmented Objects in Target Ambiguous Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (7)