Table of Contents
Fetching ...

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

Peng Liu, Haozhan Shen, Chunxin Fang, Zhicheng Sun, Jiajia Liao, Tiancheng Zhao

TL;DR

VLM-FO1 addresses the gap between high-level reasoning and fine-grained perception in Vision-Language Models by reframing object-centric perception as a feature retrieval task. It introduces a plug-and-play Hybrid Fine-grained Region Encoder with a Dual-Vision Encoder to produce rich region tokens, and a token-based grounding scheme that links language to specific visual regions via region indices. A two-stage training strategy—Region-Language Alignment followed by Perception Instruction Finetuning—enables strong fine-grained perception without eroding the VLM’s general capabilities. Across COCO, REC, counting, OCR, and ground-truth tasks, VLM-FO1 achieves state-of-the-art performance while preserving base model performance, demonstrating a scalable path to perception-aware VLMs with practical impact on grounding, region understanding, and visual reasoning. The proposed modular design and external detector integration (OPN) offer a flexible framework for deploying perception-enhanced VLMs in real-world settings.

Abstract

Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization. This failure stems from a fundamental mismatch, as generating exact numerical coordinates is a challenging task for language-centric architectures. In this paper, we introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception from a brittle coordinate generation problem into a robust feature retrieval task. Our method operates as a plug-and-play module that integrates with any pre-trained VLM. It leverages a Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to generate powerful region tokens rich in both semantic and spatial detail. A token-based referencing system then enables the LLM to seamlessly reason about and ground language in these specific visual regions. Experiments show that VLM-FO1 achieves state-of-the-art performance across a diverse suite of benchmarks, demonstrating exceptional capabilities in object grounding, region generational understanding, and visual region reasoning. Crucially, our two-stage training strategy ensures that these perception gains are achieved without compromising the base model's general visual understanding capabilities. VLM-FO1 establishes an effective and flexible paradigm for building perception-aware VLMs, bridging the gap between high-level reasoning and fine-grained visual grounding.

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

TL;DR

VLM-FO1 addresses the gap between high-level reasoning and fine-grained perception in Vision-Language Models by reframing object-centric perception as a feature retrieval task. It introduces a plug-and-play Hybrid Fine-grained Region Encoder with a Dual-Vision Encoder to produce rich region tokens, and a token-based grounding scheme that links language to specific visual regions via region indices. A two-stage training strategy—Region-Language Alignment followed by Perception Instruction Finetuning—enables strong fine-grained perception without eroding the VLM’s general capabilities. Across COCO, REC, counting, OCR, and ground-truth tasks, VLM-FO1 achieves state-of-the-art performance while preserving base model performance, demonstrating a scalable path to perception-aware VLMs with practical impact on grounding, region understanding, and visual reasoning. The proposed modular design and external detector integration (OPN) offer a flexible framework for deploying perception-enhanced VLMs in real-world settings.

Abstract

Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization. This failure stems from a fundamental mismatch, as generating exact numerical coordinates is a challenging task for language-centric architectures. In this paper, we introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception from a brittle coordinate generation problem into a robust feature retrieval task. Our method operates as a plug-and-play module that integrates with any pre-trained VLM. It leverages a Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to generate powerful region tokens rich in both semantic and spatial detail. A token-based referencing system then enables the LLM to seamlessly reason about and ground language in these specific visual regions. Experiments show that VLM-FO1 achieves state-of-the-art performance across a diverse suite of benchmarks, demonstrating exceptional capabilities in object grounding, region generational understanding, and visual region reasoning. Crucially, our two-stage training strategy ensures that these perception gains are achieved without compromising the base model's general visual understanding capabilities. VLM-FO1 establishes an effective and flexible paradigm for building perception-aware VLMs, bridging the gap between high-level reasoning and fine-grained visual grounding.

Paper Structure

This paper contains 19 sections, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Visualization of VLM-FO1's perception abilities on diverse visual tasks.
  • Figure 2: An overview of our proposed model architecture. The components enclosed within the blue dotted line represent a standard pretrained VLM, which can be initialized with existing weights to preserve its original performance. Our method introduces external modules, including a dual-vision encoder system and a Hybrid Fine-grained Region Encoder (HFRE), to enrich the VLM's fine-grained perception. These modules process an image to extract and fuse multi-scale region features, which are then fed as new region tokens to the VLM. The inset on the right details the generation of the Hybrid Region Feature.
  • Figure 3: Visualization of VLM-FO1's perception abilities on OD task.
  • Figure 4: Visualization of VLM-FO1's perception abilities on REC task.
  • Figure 5: Visualization of VLM-FO1's perception abilities on object counting task.
  • ...and 7 more figures