Table of Contents
Fetching ...

LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models

Pranav Saxena, Avigyan Bhattacharya, Ji Zhang, Wenshan Wang

TL;DR

The paper tackles outdoor referential grounding in driving scenarios, where large-scale, diverse, and dynamic environments complicate linking language to objects. It introduces LLM-RG, a modular hybrid pipeline that uses vision-language models for fine-grained attribute extraction and large language models for symbolic reasoning, operating in zero-shot without task-specific fine-tuning. The approach first filters object categories with an LLM, detects candidates with an open-vocabulary detector, enriches per-object descriptions with a VLM, and then reasons over a structured prompt with an LLM employing chain-of-thought to identify the target bounding box; 3D spatial cues are shown to further improve performance in ablations. On the Talk2Car benchmark, LLM-RG achieves substantial gains over purely LLM or VLM baselines, and gains are amplified when incorporating 3D spatial information, demonstrating the practical potential of a modular, training-free grounding framework for outdoor scenes. The work highlights the complementary strengths of VLMs and LLMs and points toward future integration of richer modalities and temporal reasoning for robust real-world grounding.

Abstract

Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., "the black car on the right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent's bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.

LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models

TL;DR

The paper tackles outdoor referential grounding in driving scenarios, where large-scale, diverse, and dynamic environments complicate linking language to objects. It introduces LLM-RG, a modular hybrid pipeline that uses vision-language models for fine-grained attribute extraction and large language models for symbolic reasoning, operating in zero-shot without task-specific fine-tuning. The approach first filters object categories with an LLM, detects candidates with an open-vocabulary detector, enriches per-object descriptions with a VLM, and then reasons over a structured prompt with an LLM employing chain-of-thought to identify the target bounding box; 3D spatial cues are shown to further improve performance in ablations. On the Talk2Car benchmark, LLM-RG achieves substantial gains over purely LLM or VLM baselines, and gains are amplified when incorporating 3D spatial information, demonstrating the practical potential of a modular, training-free grounding framework for outdoor scenes. The work highlights the complementary strengths of VLMs and LLMs and points toward future integration of richer modalities and temporal reasoning for robust real-world grounding.

Abstract

Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., "the black car on the right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent's bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.

Paper Structure

This paper contains 15 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An example output of LLM-RG on a scene from Talk2Car. Red box denotes Ground Truth bounding box from Talk2Car, green box denotes the predicted bounding box using LLM-RG.
  • Figure 2: Architecture of LLM-RG:(A) A large language model (LLM) processes the referring expression to identify relevant object types and attributes, generating a shortlist of candidate objects. (B) MMDetection is used to detect objects and obtain 2D bounding boxes. (C) Object crops for each detection are extracted and passed to a vision-language model (VLM) which provides fine-grained descriptions of each candidate object, capturing properties such as color, type, orientation, and contextual details. (D) The LLM combines object IDs, spatial locations, and object descriptions to reason over the referring expression and identify the bounding box of the target object.
  • Figure 3: Example of an object caption from a VLM that includes fine-grained attributes to be used for further reasoning.
  • Figure 4: Qualitative results of LLM-RG on Talk2Car (first row) and mecanum robot (second row). Red box denotes Ground Truth bounding box from Talk2Car, green box denotes the predicted bounding box using LLM-RG.