Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints

Chen Jiang; Allie Luo; Martin Jagersand

Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints

Chen Jiang, Allie Luo, Martin Jagersand

TL;DR

This work tackles perception-to-action in eye-in-hand robot manipulation under language guidance by introducing CLIPU$^2$Net, a compact referring image segmentation model that yields fine-grained saliency with a decoder only 6.6MB in size. Saliency is transformed into actionable commands through geometric constraints (points and lines) within an online visuomotor framework, supported by masked multimodal fusion and a U$^2$Net-based decoder for efficient segmentation. The method is validated on 46 real-world tasks and multiple segmentation benchmarks, showing advantages over labor-intensive feature annotations and existing baselines while maintaining real-time performance. The results imply that compact, language-conditioned perception with geometric reasoning can robustly drive manipulation across diverse contexts, with potential for further gains through part-aware segmentation and joint learning of segmentation and constraints.

Abstract

In this paper, we perform robot manipulation activities in real-world environments with language contexts by integrating a compact referring image segmentation model into the robot's perception module. First, we propose CLIPU$^2$Net, a lightweight referring image segmentation model designed for fine-grain boundary and structure segmentation from language expressions. Then, we deploy the model in an eye-in-hand visual servoing system to enact robot control in the real world. The key to our system is the representation of salient visual information as geometric constraints, linking the robot's visual perception to actionable commands. Experimental results on 46 real-world robot manipulation tasks demonstrate that our method outperforms traditional visual servoing methods relying on labor-intensive feature annotations, excels in fine-grain referring image segmentation with a compact decoder size of 6.6 MB, and supports robot control across diverse contexts.

Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints

TL;DR

This work tackles perception-to-action in eye-in-hand robot manipulation under language guidance by introducing CLIPU

Net, a compact referring image segmentation model that yields fine-grained saliency with a decoder only 6.6MB in size. Saliency is transformed into actionable commands through geometric constraints (points and lines) within an online visuomotor framework, supported by masked multimodal fusion and a U

Net-based decoder for efficient segmentation. The method is validated on 46 real-world tasks and multiple segmentation benchmarks, showing advantages over labor-intensive feature annotations and existing baselines while maintaining real-time performance. The results imply that compact, language-conditioned perception with geometric reasoning can robustly drive manipulation across diverse contexts, with potential for further gains through part-aware segmentation and joint learning of segmentation and constraints.

Abstract

Net, a lightweight referring image segmentation model designed for fine-grain boundary and structure segmentation from language expressions. Then, we deploy the model in an eye-in-hand visual servoing system to enact robot control in the real world. The key to our system is the representation of salient visual information as geometric constraints, linking the robot's visual perception to actionable commands. Experimental results on 46 real-world robot manipulation tasks demonstrate that our method outperforms traditional visual servoing methods relying on labor-intensive feature annotations, excels in fine-grain referring image segmentation with a compact decoder size of 6.6 MB, and supports robot control across diverse contexts.

Paper Structure (12 sections, 5 equations, 5 figures, 4 tables)

This paper contains 12 sections, 5 equations, 5 figures, 4 tables.

Introduction
Related Work
Salient Visual Features in Robot Control
Vision-language Models in Robot Control
Methodology
Network Architecture
Geometric Constraints in Salient Vision
Experiments
Experimental Settings
Results on Referring Image Segmentation
Results on Robot Control
Conclusions

Figures (5)

Figure 1: Overview of the system to enact real-world robot control with CLIPU$^2$Net and UIBVS control.
Figure 2: The architecture of CLIPU$^2$Net.
Figure 3: Qualitative results for referring image segmentation.
Figure 4: Results of the predicted geometric constraints and motions for 8 of the 46 assessed tasks.
Figure 5: Some failure cases.

Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints

TL;DR

Abstract

Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints

Authors

TL;DR

Abstract

Table of Contents

Figures (5)