LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
Pranav Saxena, Avigyan Bhattacharya, Ji Zhang, Wenshan Wang
TL;DR
The paper tackles outdoor referential grounding in driving scenarios, where large-scale, diverse, and dynamic environments complicate linking language to objects. It introduces LLM-RG, a modular hybrid pipeline that uses vision-language models for fine-grained attribute extraction and large language models for symbolic reasoning, operating in zero-shot without task-specific fine-tuning. The approach first filters object categories with an LLM, detects candidates with an open-vocabulary detector, enriches per-object descriptions with a VLM, and then reasons over a structured prompt with an LLM employing chain-of-thought to identify the target bounding box; 3D spatial cues are shown to further improve performance in ablations. On the Talk2Car benchmark, LLM-RG achieves substantial gains over purely LLM or VLM baselines, and gains are amplified when incorporating 3D spatial information, demonstrating the practical potential of a modular, training-free grounding framework for outdoor scenes. The work highlights the complementary strengths of VLMs and LLMs and points toward future integration of richer modalities and temporal reasoning for robust real-world grounding.
Abstract
Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., "the black car on the right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent's bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.
