HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models

Vineet Bhat; Prashanth Krishnamurthy; Ramesh Karri; Farshad Khorrami

HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models

Vineet Bhat, Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami

TL;DR

HiFi-CS introduces a lightweight, language-conditioned visual grounding model for robotic grasping that freezes CLIP and augments a small decoder with hierarchical FiLM fusion to handle complex referring expressions. It achieves an average IoU of about 87% in closed-vocabulary VG and greatly reduces parameters (~6M) while outperforming baselines, including open-set detectors, in cluttered robotic settings. The approach also enables open-vocabulary grounding by guiding open-set detectors like GroundedSAM, and demonstrates real-world efficacy on a 7-DOF robot with 90.33% visual grounding accuracy across 15 scenes. Together, these results advance robust, data-efficient grounding for autonomously locating referred objects and planning grasps in open and cluttered environments.

Abstract

Robots interacting with humans through natural language can unlock numerous applications such as Referring Grasp Synthesis (RGS). Given a text query, RGS determines a stable grasp pose to manipulate the referred object in the robot's workspace. RGS comprises two steps: visual grounding and grasp pose estimation. Recent studies leverage powerful Vision-Language Models (VLMs) for visually grounding free-flowing natural language in real-world robotic execution. However, comparisons in complex, cluttered environments with multiple instances of the same object are lacking. This paper introduces HiFi-CS, featuring hierarchical application of Featurewise Linear Modulation (FiLM) to fuse image and text embeddings, enhancing visual grounding for complex attribute rich text queries encountered in robotic grasping. Visual grounding associates an object in 2D/3D space with natural language input and is studied in two scenarios: Closed and Open Vocabulary. HiFi-CS features a lightweight decoder combined with a frozen VLM and outperforms competitive baselines in closed vocabulary settings while being 100x smaller in size. Our model can effectively guide open-set object detectors like GroundedSAM to enhance open-vocabulary performance. We validate our approach through real-world RGS experiments using a 7-DOF robotic arm, achieving 90.33\% visual grounding accuracy in 15 tabletop scenes. Our codebase is provided here: https://github.com/vineet2104/hifics

HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models

TL;DR

Abstract

Paper Structure (11 sections, 2 equations, 4 figures, 5 tables)

This paper contains 11 sections, 2 equations, 4 figures, 5 tables.

Introduction
Background and Related Work
Proposed Method: Hierarchical FiLM - ClipSeg (HiFi-CS)
Experimental Results
Datasets
Baselines
Experimental Setup
Closed Vocabulary
Open Vocabulary
Real World Experiments
Conclusion and Future Work

Figures (4)

Figure 1: Referring Grasp Synthesis converts free-flowing language query to robot grasp pose.
Figure 2: HiFi-CS for Robotic Visual Grounding. Left: Blue modules are frozen, we choose K = {1, 3, 5, 7, 9}. Right: Zoomed-in view of trainable decoder. ViT: Vision Transformer.
Figure 3: Comparing Visual Grounding baselines across text queries. More attributes increase complexity, requiring the instance mask to be conditioned on properties like color, shape, and position.
Figure 4: Language Guided Object Manipulation. Left: Robot captures top view image. Right: Referred object grasp is executed.

HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models

TL;DR

Abstract

HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)