Table of Contents
Fetching ...

FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis

Zhe Chen, Zijing Chen

TL;DR

FLORA tackles zero-shot Object Referring Analysis by introducing a Formal Language Model to regulate LLM outputs into structured object descriptions, coupled with a probabilistic parsing and interpretation framework. The method leverages off-the-shelf models (e.g., Grounding DINO, CLIP, SAM) within a Bayesian inference scheme to compute P(x|O) from decomposed cues O = {O_T,O_L,O_V,O_R}, enabling training-free localization and segmentation. Across RefCOCO, RefCOCO+, RefCOCOg, Who’s Waldo, and PhraseCut, FLORA with GDINO achieves state-of-the-art zero-shot results and up to ~45% relative improvements over strong baselines, demonstrating robust reasoning and interpretability in the presence of LLM hallucinations. The results suggest a practical, data-efficient path for robust visual-language grounding, with potential extensions to other multimodal tasks and real-world scenarios where labeled data are scarce.

Abstract

Object Referring Analysis (ORA), commonly known as referring expression comprehension, requires the identification and localization of specific objects in an image based on natural descriptions. Unlike generic object detection, ORA requires both accurate language understanding and precise visual localization, making it inherently more complex. Although recent pre-trained large visual grounding detectors have achieved significant progress, they heavily rely on extensively labeled data and time-consuming learning. To address these, we introduce a novel, training-free framework for zero-shot ORA, termed FLORA (Formal Language for Object Referring and Analysis). FLORA harnesses the inherent reasoning capabilities of large language models (LLMs) and integrates a formal language model - a logical framework that regulates language within structured, rule-based descriptions - to provide effective zero-shot ORA. More specifically, our formal language model (FLM) enables an effective, logic-driven interpretation of object descriptions without necessitating any training processes. Built upon FLM-regulated LLM outputs, we further devise a Bayesian inference framework and employ appropriate off-the-shelf interpretive models to finalize the reasoning, delivering favorable robustness against LLM hallucinations and compelling ORA performance in a training-free manner. In practice, our FLORA boosts the zero-shot performance of existing pretrained grounding detectors by up to around 45%. Our comprehensive evaluation across different challenging datasets also confirms that FLORA consistently surpasses current state-of-the-art zero-shot methods in both detection and segmentation tasks associated with zero-shot ORA. We believe our probabilistic parsing and reasoning of the LLM outputs elevate the reliability and interpretability of zero-shot ORA. We shall release codes upon publication.

FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis

TL;DR

FLORA tackles zero-shot Object Referring Analysis by introducing a Formal Language Model to regulate LLM outputs into structured object descriptions, coupled with a probabilistic parsing and interpretation framework. The method leverages off-the-shelf models (e.g., Grounding DINO, CLIP, SAM) within a Bayesian inference scheme to compute P(x|O) from decomposed cues O = {O_T,O_L,O_V,O_R}, enabling training-free localization and segmentation. Across RefCOCO, RefCOCO+, RefCOCOg, Who’s Waldo, and PhraseCut, FLORA with GDINO achieves state-of-the-art zero-shot results and up to ~45% relative improvements over strong baselines, demonstrating robust reasoning and interpretability in the presence of LLM hallucinations. The results suggest a practical, data-efficient path for robust visual-language grounding, with potential extensions to other multimodal tasks and real-world scenarios where labeled data are scarce.

Abstract

Object Referring Analysis (ORA), commonly known as referring expression comprehension, requires the identification and localization of specific objects in an image based on natural descriptions. Unlike generic object detection, ORA requires both accurate language understanding and precise visual localization, making it inherently more complex. Although recent pre-trained large visual grounding detectors have achieved significant progress, they heavily rely on extensively labeled data and time-consuming learning. To address these, we introduce a novel, training-free framework for zero-shot ORA, termed FLORA (Formal Language for Object Referring and Analysis). FLORA harnesses the inherent reasoning capabilities of large language models (LLMs) and integrates a formal language model - a logical framework that regulates language within structured, rule-based descriptions - to provide effective zero-shot ORA. More specifically, our formal language model (FLM) enables an effective, logic-driven interpretation of object descriptions without necessitating any training processes. Built upon FLM-regulated LLM outputs, we further devise a Bayesian inference framework and employ appropriate off-the-shelf interpretive models to finalize the reasoning, delivering favorable robustness against LLM hallucinations and compelling ORA performance in a training-free manner. In practice, our FLORA boosts the zero-shot performance of existing pretrained grounding detectors by up to around 45%. Our comprehensive evaluation across different challenging datasets also confirms that FLORA consistently surpasses current state-of-the-art zero-shot methods in both detection and segmentation tasks associated with zero-shot ORA. We believe our probabilistic parsing and reasoning of the LLM outputs elevate the reliability and interpretability of zero-shot ORA. We shall release codes upon publication.
Paper Structure (29 sections, 13 equations, 3 figures, 7 tables)

This paper contains 29 sections, 13 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: The concept behind our Formal Language-based Object Referring Analysis (FLORA) framework, which enables outstanding accuracy in a training-free zero-shot setting.
  • Figure 2: Overall pipeline of our training-free zero-shot FLORA.
  • Figure 3: Some visualized results of ORA using GDINOliu2023grounding baseline and our FLORA.