Table of Contents
Fetching ...

CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

Yu Qi, Yumeng Zhang, Chenting Gong, Xiao Tan, Weiming Zhang, Wei Zhang, Jingdong Wang

TL;DR

CoT4Det reframes perception-centric vision-language tasks as a three-stage chain of thought—classification, counting, and grounding—to better leverage LVLM reasoning. By freezing the vision encoder and training only the language model on a mixed corpus of detection-style and general vision-language data, it achieves large mAP gains on COCO and strong grounding performance without architectural changes. The method also preserves general VQA capabilities, evidenced by competitive MME and MMBench results, and ablations show that the structured reasoning, not just higher input resolution, drives improvements. This work suggests that explicit, interpretable reasoning chains can unlock dense perception capabilities in LVLMs with minimal architectural modification.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.

CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

TL;DR

CoT4Det reframes perception-centric vision-language tasks as a three-stage chain of thought—classification, counting, and grounding—to better leverage LVLM reasoning. By freezing the vision encoder and training only the language model on a mixed corpus of detection-style and general vision-language data, it achieves large mAP gains on COCO and strong grounding performance without architectural changes. The method also preserves general VQA capabilities, evidenced by competitive MME and MMBench results, and ablations show that the structured reasoning, not just higher input resolution, drives improvements. This work suggests that explicit, interpretable reasoning chains can unlock dense perception capabilities in LVLMs with minimal architectural modification.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.

Paper Structure

This paper contains 17 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: An illustration of the proposed CoT4Det framework. The object detection task is reformulated into a multi-step reasoning process comprising (1) object category classification, (2) instance counting, and (3) spatial grounding. The proposed model demonstrates improved performance in challenging visual scenarios, including small objects, crowded scenes, and open-vocabulary settings.
  • Figure 2: Common failure cases of LVLMs in object localization tasks, including redundant predictions, inability to reject nonexistent objects, and low recall in dense scenes.
  • Figure 3: Qualitative comparison between Qwen2.5-VL-7B-Instruct and our CoT4Det-7B across several dense scenes with abundant small objects. CoT4Det demonstrates superior capability in detecting small and tightly clustered objects—such as people in crowds, umbrellas, or chairs in indoor environments—while significantly reducing redundant predictions and false positives. Ground truth annotations are shown for reference