Table of Contents
Fetching ...

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

Haoyu Zhao, Wenhang Ge, Ying-cong Chen

TL;DR

This paper tackles universal visual grounding by enabling open-ended text queries to locate arbitrary objects in images without additional training. It introduces LLM-Optic, a three-module pipeline: an LLM-based Text Grounder to interpret queries, a Candidate Positioning module that uses Grounding DINO to propose candidate boxes and applies numeric marks to link text and image regions, and a Large Multimodal Model (LMM) Visual Grounder to select the matching marked object. The method achieves state-of-the-art zero-shot performance across RefCOCO, RefCOCOg, and the Description Detection Dataset ($D^{3}$), including a notable 22% improvement on RefCOCOg, while maintaining a fully training-free, modular design that facilitates rapid integration of new models. This work broadens visual grounding from single-object or closed-set tasks to universal grounding, with significant practical implications for robust, real-time understanding of complex natural language queries in vision systems.

Abstract

Visual grounding is an essential tool that links user-provided text queries with query-specific regions within an image. Despite advancements in visual grounding models, their ability to comprehend complex queries remains limited. To overcome this limitation, we introduce LLM-Optic, an innovative method that utilizes Large Language Models (LLMs) as an optical lens to enhance existing visual grounding models in comprehending complex text queries involving intricate text structures, multiple objects, or object spatial relationships, situations that current models struggle with. LLM-Optic first employs an LLM as a Text Grounder to interpret complex text queries and accurately identify objects the user intends to locate. Then a pre-trained visual grounding model is used to generate candidate bounding boxes given the refined query by the Text Grounder. After that, LLM-Optic annotates the candidate bounding boxes with numerical marks to establish a connection between text and specific image regions, thereby linking two distinct modalities. Finally, it employs a Large Multimodal Model (LMM) as a Visual Grounder to select the marked candidate objects that best correspond to the original text query. Through LLM-Optic, we have achieved universal visual grounding, which allows for the detection of arbitrary objects specified by arbitrary human language input. Importantly, our method achieves this enhancement without requiring additional training or fine-tuning. Extensive experiments across various challenging benchmarks demonstrate that LLM-Optic achieves state-of-the-art zero-shot visual grounding capabilities. Project Page: https://haoyu-zhao.github.io/LLM-Optic.github.io/.

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

TL;DR

This paper tackles universal visual grounding by enabling open-ended text queries to locate arbitrary objects in images without additional training. It introduces LLM-Optic, a three-module pipeline: an LLM-based Text Grounder to interpret queries, a Candidate Positioning module that uses Grounding DINO to propose candidate boxes and applies numeric marks to link text and image regions, and a Large Multimodal Model (LMM) Visual Grounder to select the matching marked object. The method achieves state-of-the-art zero-shot performance across RefCOCO, RefCOCOg, and the Description Detection Dataset (), including a notable 22% improvement on RefCOCOg, while maintaining a fully training-free, modular design that facilitates rapid integration of new models. This work broadens visual grounding from single-object or closed-set tasks to universal grounding, with significant practical implications for robust, real-time understanding of complex natural language queries in vision systems.

Abstract

Visual grounding is an essential tool that links user-provided text queries with query-specific regions within an image. Despite advancements in visual grounding models, their ability to comprehend complex queries remains limited. To overcome this limitation, we introduce LLM-Optic, an innovative method that utilizes Large Language Models (LLMs) as an optical lens to enhance existing visual grounding models in comprehending complex text queries involving intricate text structures, multiple objects, or object spatial relationships, situations that current models struggle with. LLM-Optic first employs an LLM as a Text Grounder to interpret complex text queries and accurately identify objects the user intends to locate. Then a pre-trained visual grounding model is used to generate candidate bounding boxes given the refined query by the Text Grounder. After that, LLM-Optic annotates the candidate bounding boxes with numerical marks to establish a connection between text and specific image regions, thereby linking two distinct modalities. Finally, it employs a Large Multimodal Model (LMM) as a Visual Grounder to select the marked candidate objects that best correspond to the original text query. Through LLM-Optic, we have achieved universal visual grounding, which allows for the detection of arbitrary objects specified by arbitrary human language input. Importantly, our method achieves this enhancement without requiring additional training or fine-tuning. Extensive experiments across various challenging benchmarks demonstrate that LLM-Optic achieves state-of-the-art zero-shot visual grounding capabilities. Project Page: https://haoyu-zhao.github.io/LLM-Optic.github.io/.
Paper Structure (36 sections, 12 figures, 3 tables)

This paper contains 36 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: LLM-Optic enhances the capabilities of the leading visual grounding model, Grounding DINO, by integrating the reasoning abilities of Large Language Models (LLMs), thereby achieving superior accuracy in visual grounding within any given query. Specifically, Grounding DINO exhibits limitations in the following areas: (1) it struggles with complex sentence structures, as demonstrated in Query (A); (2) it faces challenges with queries involving multiple objects and often fails to distinguish the primary object from its landmarks for precise localization (Query (B)); (3) it incorrectly interprets spatial relationships (Query (C)). However, our framework effectively addresses these issues.
  • Figure 2: Overview of LLM-Optic. We propose using LLMs and LMMs as effective reasoning modules for handling complex user queries to achieve universal visual grounding. Our framework includes three key modules: an LLM-based Text Grounder, a Candidate Positioning and Setting Marks module, and an LMM-based Visual Grounder. It does not require any additional training and features a fully modular design, allowing for the seamless integration of rapid advancements in new technologies.
  • Figure 4: An example of Text Grounder and Visual Grounder output. We have increased the size of the marks to enhance visibility; however, the actual marks are smaller, as shown in the Additional Results in Appendix \ref{['appendix.E']}, to avoid obscuring the target object.
  • Figure 5: LLM-Optic is capable of handling scenarios where the query does not correspond to any object in the image or corresponds to multiple objects within the image.
  • Figure 6: Failure cases of LLM-Optic.
  • ...and 7 more figures