GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning
Rui Tang, Guankun Wang, Long Bai, Huxin Gao, Jiewen Lai, Chi Kit Ng, Jiazheng Wang, Fan Zhang, Hongliang Ren
TL;DR
GeoLanG tackles language-guided grasping in open-world, cluttered environments by unifying RGB-D perception and natural language in a shared representation space. It introduces Depth-Guided Geometric Module (DGGM) to inject depth-derived geometric priors into attention, and Adaptive Dense Channel Integration (ADCI) to fuse multi-layer visual features, all within a CLIP-based end-to-end framework (CLIP-VMamba + CLIP-BERT). On OCID-VLG, GeoLanG achieves state-of-the-art performance in both segmentation and grasping, and demonstrates strong generalization to unseen objects, supported by real-robot experiments. The approach reduces dependence on external detectors, improves robustness to occlusions and low-texture regions, and advances multimodal manipulation in human-centered settings by aligning semantic and spatial cues for precise referring grasping.
Abstract
Language-guided grasping has emerged as a promising paradigm for enabling robots to identify and manipulate target objects through natural language instructions, yet it remains highly challenging in cluttered or occluded scenes. Existing methods often rely on multi-stage pipelines that separate object perception and grasping, which leads to limited cross-modal fusion, redundant computation, and poor generalization in cluttered, occluded, or low-texture scenes. To address these limitations, we propose GeoLanG, an end-to-end multi-task framework built upon the CLIP architecture that unifies visual and linguistic inputs into a shared representation space for robust semantic alignment and improved generalization. To enhance target discrimination under occlusion and low-texture conditions, we explore a more effective use of depth information through the Depth-guided Geometric Module (DGGM), which converts depth into explicit geometric priors and injects them into the attention mechanism without additional computational overhead. In addition, we propose Adaptive Dense Channel Integration, which adaptively balances the contributions of multi-layer features to produce more discriminative and generalizable visual representations. Extensive experiments on the OCID-VLG dataset, as well as in both simulation and real-world hardware, demonstrate that GeoLanG enables precise and robust language-guided grasping in complex, cluttered environments, paving the way toward more reliable multimodal robotic manipulation in real-world human-centric settings.
