Table of Contents
Fetching ...

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao Wang

TL;DR

<3-5 sentence high-level summary> This work tackles the challenge of locating all objects described in free-form text at varying granularities using Large Vision Language Models. It introduces a Language-prompted Localization Dataset and Griffon, a purely LVLM-based localization baseline that uses a unified output format and a two-stage instruction-tuning pipeline, complemented by a training-free confidence scorer. Griffon achieves state-of-the-art results on RefCOCO and Flickr30K Entities and approaches Faster RCNN on MSCOCO object detection, demonstrating that open-ended LVLMs can perform fine-grained localization without external detectors or specialized heads. The paper provides a data-and-methodology blueprint for closing the localization gap in LVLMs and lays groundwork for broader integration of localization tasks into unified vision-language systems.

Abstract

Replicating the innate human ability to detect all objects based on free-form texts at any granularity remains a formidable challenge for Large Vision Language Models (LVLMs). Current LVLMs are predominantly constrained to locate a single, pre-existing object. This limitation leads to a compromise in model design, necessitating the introduction of visual expert models or customized head structures. Beyond these constraints, our research uncovers LVLMs' capability for basic object perception, allowing them to accurately identify and locate objects of interest. Building on this insight, we introduce a novel Language-prompted Localization Dataset to fully unleash the capabilities of LVLMs in fine-grained object perception and precise location awareness. More importantly, we present Griffon, a purely LVLM-based baseline, which does not introduce any special tokens, expert models, or additional detection modules. It simply maintains a consistent structure with popular LVLMs by unifying data formats across various localization-related scenarios and is trained end-to-end through a well-designed pipeline. Comprehensive experiments demonstrate that Griffon not only achieves state-of-the-art performance on the fine-grained RefCOCO series and Flickr30K Entities but also approaches the capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO. Data, codes, and models are released at https://github.com/jefferyZhan/Griffon.

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

TL;DR

<3-5 sentence high-level summary> This work tackles the challenge of locating all objects described in free-form text at varying granularities using Large Vision Language Models. It introduces a Language-prompted Localization Dataset and Griffon, a purely LVLM-based localization baseline that uses a unified output format and a two-stage instruction-tuning pipeline, complemented by a training-free confidence scorer. Griffon achieves state-of-the-art results on RefCOCO and Flickr30K Entities and approaches Faster RCNN on MSCOCO object detection, demonstrating that open-ended LVLMs can perform fine-grained localization without external detectors or specialized heads. The paper provides a data-and-methodology blueprint for closing the localization gap in LVLMs and lays groundwork for broader integration of localization tasks into unified vision-language systems.

Abstract

Replicating the innate human ability to detect all objects based on free-form texts at any granularity remains a formidable challenge for Large Vision Language Models (LVLMs). Current LVLMs are predominantly constrained to locate a single, pre-existing object. This limitation leads to a compromise in model design, necessitating the introduction of visual expert models or customized head structures. Beyond these constraints, our research uncovers LVLMs' capability for basic object perception, allowing them to accurately identify and locate objects of interest. Building on this insight, we introduce a novel Language-prompted Localization Dataset to fully unleash the capabilities of LVLMs in fine-grained object perception and precise location awareness. More importantly, we present Griffon, a purely LVLM-based baseline, which does not introduce any special tokens, expert models, or additional detection modules. It simply maintains a consistent structure with popular LVLMs by unifying data formats across various localization-related scenarios and is trained end-to-end through a well-designed pipeline. Comprehensive experiments demonstrate that Griffon not only achieves state-of-the-art performance on the fine-grained RefCOCO series and Flickr30K Entities but also approaches the capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO. Data, codes, and models are released at https://github.com/jefferyZhan/Griffon.
Paper Structure (25 sections, 4 equations, 4 figures, 5 tables)

This paper contains 25 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Four Types of Localization-Related Scenarios. The overall localization task is partitioned into four scenarios based on the number of labels and the number of objects involved. Current LVLMs fail to refuse non-existing objects and detect multiple objects from one or multi-target descriptions (referents, categories, phrases etc.).
  • Figure 2: Data Generation and Training Procedure.Griffon follows a progressive two-stage training pipeline with the built dataset in stage 0. In different stages, distinct modules of Griffon are trained. The red flame represents that this module is being trained at this stage, while the gray snowflake indicates the opposite.
  • Figure 3: Samples of Language-Prompted Localization Dataset. All the images of the benchmark are collected and filtered from public datasets. Instructions are generated with the proposed method using GPT-4Vopenai2023gpt4. The red indicates that this object does not exist in the image, while the green indicates the opposite.
  • Figure 4: Visualization results of Qwen-VLQwen-VL, Grounding DINOliu2023grounding and Griffon across all four scenarios. We use abbreviations to substitute the four scenarios.