Table of Contents
Fetching ...

Intelligent Image Search Algorithms Fusing Visual Large Models

Kehan Wang, Tingqiong Cui, Yang Zhang, Yu Chen, Shifeng Wu, Zhenzhang Li

TL;DR

This work addresses the challenge of fine-grained image retrieval requiring both component existence and state understanding, a gap between fast detectors and semantically rich but costly VLMs. It proposes DetVLM, a two-stage Detection–VLM framework that first uses a high-recall YOLO detector for candidate screening and then a Visual Large Model for recall-enhancement, state analysis, and zero-shot recognition guided by dynamic prompts. The approach achieves a state-of-the-art retrieval accuracy of 94.82% and a zero-shot driver-mask detection accuracy of 94.95%, with over 90% average accuracy in complex state judgments, validated on a purpose-built vehicle dataset of 3,400 images. These results demonstrate a practical, scalable solution that integrates basic component search with semantic state reasoning for real-world public security and industrial applications, while offering generalizability to other fine-grained domains.

Abstract

Fine-grained image retrieval, which aims to find images containing specific object components and assess their detailed states, is critical in fields like security and industrial inspection. However, conventional methods face significant limitations: manual features (e.g., SIFT) lack robustness; deep learning-based detectors (e.g., YOLO) can identify component presence but cannot perform state-specific retrieval or zero-shot search; Visual Large Models (VLMs) offer semantic and zero-shot capabilities but suffer from poor spatial grounding and high computational cost, making them inefficient for direct retrieval. To bridge these gaps, this paper proposes DetVLM, a novel intelligent image search framework that synergistically fuses object detection with VLMs. The framework pioneers a search-enhancement paradigm via a two-stage pipeline: a YOLO detector first conducts efficient, high-recall component-level screening to determine component presence; then, a VLM acts as a recall-enhancement unit, performing secondary verification for components missed by the detector. This architecture directly enables two advanced capabilities: 1) State Search: Guided by task-specific prompts, the VLM refines results by verifying component existence and executing sophisticated state judgments (e.g., "sun visor lowered"), allowing retrieval based on component state. 2) Zero-shot Search: The framework leverages the VLM's inherent zero-shot capability to recognize and retrieve images containing unseen components or attributes (e.g., "driver wearing a mask") without any task-specific training. Experiments on a vehicle component dataset show DetVLM achieves a state-of-the-art overall retrieval accuracy of 94.82\%, significantly outperforming detection-only baselines. It also attains 94.95\% accuracy in zero-shot search for driver mask-wearing and over 90\% average accuracy in state search tasks.

Intelligent Image Search Algorithms Fusing Visual Large Models

TL;DR

This work addresses the challenge of fine-grained image retrieval requiring both component existence and state understanding, a gap between fast detectors and semantically rich but costly VLMs. It proposes DetVLM, a two-stage Detection–VLM framework that first uses a high-recall YOLO detector for candidate screening and then a Visual Large Model for recall-enhancement, state analysis, and zero-shot recognition guided by dynamic prompts. The approach achieves a state-of-the-art retrieval accuracy of 94.82% and a zero-shot driver-mask detection accuracy of 94.95%, with over 90% average accuracy in complex state judgments, validated on a purpose-built vehicle dataset of 3,400 images. These results demonstrate a practical, scalable solution that integrates basic component search with semantic state reasoning for real-world public security and industrial applications, while offering generalizability to other fine-grained domains.

Abstract

Fine-grained image retrieval, which aims to find images containing specific object components and assess their detailed states, is critical in fields like security and industrial inspection. However, conventional methods face significant limitations: manual features (e.g., SIFT) lack robustness; deep learning-based detectors (e.g., YOLO) can identify component presence but cannot perform state-specific retrieval or zero-shot search; Visual Large Models (VLMs) offer semantic and zero-shot capabilities but suffer from poor spatial grounding and high computational cost, making them inefficient for direct retrieval. To bridge these gaps, this paper proposes DetVLM, a novel intelligent image search framework that synergistically fuses object detection with VLMs. The framework pioneers a search-enhancement paradigm via a two-stage pipeline: a YOLO detector first conducts efficient, high-recall component-level screening to determine component presence; then, a VLM acts as a recall-enhancement unit, performing secondary verification for components missed by the detector. This architecture directly enables two advanced capabilities: 1) State Search: Guided by task-specific prompts, the VLM refines results by verifying component existence and executing sophisticated state judgments (e.g., "sun visor lowered"), allowing retrieval based on component state. 2) Zero-shot Search: The framework leverages the VLM's inherent zero-shot capability to recognize and retrieve images containing unseen components or attributes (e.g., "driver wearing a mask") without any task-specific training. Experiments on a vehicle component dataset show DetVLM achieves a state-of-the-art overall retrieval accuracy of 94.82\%, significantly outperforming detection-only baselines. It also attains 94.95\% accuracy in zero-shot search for driver mask-wearing and over 90\% average accuracy in state search tasks.

Paper Structure

This paper contains 32 sections, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Framework
  • Figure 2: Annotation Example
  • Figure 3: Wear a mask
  • Figure 4: Without a mask
  • Figure 5: Reflective
  • ...and 2 more figures