GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning

Rui Tang; Guankun Wang; Long Bai; Huxin Gao; Jiewen Lai; Chi Kit Ng; Jiazheng Wang; Fan Zhang; Hongliang Ren

GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning

Rui Tang, Guankun Wang, Long Bai, Huxin Gao, Jiewen Lai, Chi Kit Ng, Jiazheng Wang, Fan Zhang, Hongliang Ren

TL;DR

GeoLanG tackles language-guided grasping in open-world, cluttered environments by unifying RGB-D perception and natural language in a shared representation space. It introduces Depth-Guided Geometric Module (DGGM) to inject depth-derived geometric priors into attention, and Adaptive Dense Channel Integration (ADCI) to fuse multi-layer visual features, all within a CLIP-based end-to-end framework (CLIP-VMamba + CLIP-BERT). On OCID-VLG, GeoLanG achieves state-of-the-art performance in both segmentation and grasping, and demonstrates strong generalization to unseen objects, supported by real-robot experiments. The approach reduces dependence on external detectors, improves robustness to occlusions and low-texture regions, and advances multimodal manipulation in human-centered settings by aligning semantic and spatial cues for precise referring grasping.

Abstract

Language-guided grasping has emerged as a promising paradigm for enabling robots to identify and manipulate target objects through natural language instructions, yet it remains highly challenging in cluttered or occluded scenes. Existing methods often rely on multi-stage pipelines that separate object perception and grasping, which leads to limited cross-modal fusion, redundant computation, and poor generalization in cluttered, occluded, or low-texture scenes. To address these limitations, we propose GeoLanG, an end-to-end multi-task framework built upon the CLIP architecture that unifies visual and linguistic inputs into a shared representation space for robust semantic alignment and improved generalization. To enhance target discrimination under occlusion and low-texture conditions, we explore a more effective use of depth information through the Depth-guided Geometric Module (DGGM), which converts depth into explicit geometric priors and injects them into the attention mechanism without additional computational overhead. In addition, we propose Adaptive Dense Channel Integration, which adaptively balances the contributions of multi-layer features to produce more discriminative and generalizable visual representations. Extensive experiments on the OCID-VLG dataset, as well as in both simulation and real-world hardware, demonstrate that GeoLanG enables precise and robust language-guided grasping in complex, cluttered environments, paving the way toward more reliable multimodal robotic manipulation in real-world human-centric settings.

GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning

TL;DR

Abstract

Paper Structure (15 sections, 10 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 15 sections, 10 equations, 7 figures, 4 tables, 1 algorithm.

Introduciton
Method
Overall Architecture of GeoLanG
Depth-guided Geometric Module
Adaptive Dense Channel Integration
Experiment
Dataset and Experiment setup
Experimental Setup
Evaluation Metrics
Comparison with baselines on OCID-VLG
Comparison with baselines on unseen items (OCID-VLG)
Ablation Studies
Robot Experiments
Dicussion
Conclusions

Figures (7)

Figure 1: Language-Guided Multimodal Perception for 6-DoF Robotic Grasping in Cluttered Environments.
Figure 2: An overview of the GeoLanG framework. Given an RGB-D image and a language query, the text encoder extracts high-level semantic features while the RGB encoder processes multi-scale visual information. Depth-derived geometric priors are incorporated via the Depth-Guided Geometric Module (DGGM) to enhance structural cues. Multi-layer visual features are then optimized through ADCI before being fed into dual-path projectors that generate pixel-level segmentation masks and refine the grasp pose for the target object.
Figure 3: The overview of the Depth-guided Geometric Module (DGGM). The pink rectangle represents image features extracted from the RGB encoder, the green rectangle denotes the learned geometry prior, and the yellow rectangle indicates the spatial prior computed from depth and RGB. $\otimes$ represents matrix multiplication and $\odot$ represents element-wise multiplication. The output feature integrates visual information with geometric and spatial cues to enhance multi-scale representations.
Figure 4: Examples from the OCID-VLG dataset tziafas2023language. (a)Single item (b) Multiple items cluttered scene (c) Multiple objects cluttered and overlapping scenes.
Figure 5: Qualitative comparison against SOTA solutions on the OCID-VLG dataset.
...and 2 more figures

GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning

TL;DR

Abstract

GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)