CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks
Yu Qi, Yumeng Zhang, Chenting Gong, Xiao Tan, Weiming Zhang, Wei Zhang, Jingdong Wang
TL;DR
CoT4Det reframes perception-centric vision-language tasks as a three-stage chain of thought—classification, counting, and grounding—to better leverage LVLM reasoning. By freezing the vision encoder and training only the language model on a mixed corpus of detection-style and general vision-language data, it achieves large mAP gains on COCO and strong grounding performance without architectural changes. The method also preserves general VQA capabilities, evidenced by competitive MME and MMBench results, and ablations show that the structured reasoning, not just higher input resolution, drives improvements. This work suggests that explicit, interpretable reasoning chains can unlock dense perception capabilities in LVLMs with minimal architectural modification.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.
