Table of Contents
Fetching ...

When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

Samer Al-Hamadani

TL;DR

This work systematically compares supervised YOLO with zero-shot vision-language models (Gemini Flash 2.5 and GPT-4V) through a Total Cost of Ownership lens, using 5,000 COCO-validation images and 500 novel product images. It demonstrates that YOLO attains higher accuracy (91.2% vs 68.5–71.3%) but incurs substantial annotation and training costs, while VLMs offer near-zero upfront costs with ongoing API charges and variable zero-shot accuracy depending on object web presence. The authors derive break-even inferences around 55 million for 100-category systems, showing that VLMs are cost-effective at typical volumes, with Gemini often the best option for cost-sensitive deployments and GPT-4 useful only in small-scale, accuracy-critical contexts. The study provides actionable frameworks and scenario-based guidance for practitioners to select detection architectures based on deployment volume, category stability, and budget, highlighting that practical economics can favor zero-shot approaches even in traditionally supervision-dominated settings.

Abstract

Object detection traditionally relies on costly manual annotation. We present the first comprehensive cost-effectiveness analysis comparing supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4). Evaluated on 5,000 stratified COCO images and 500 diverse product images, combined with Total Cost of Ownership modeling, we derive break-even thresholds for architecture selection. Results show supervised YOLO attains 91.2% accuracy versus 68.5% for Gemini and 71.3% for GPT-4 on standard categories; the annotation expense for a 100-category system is $10,800, and the accuracy advantage only pays off beyond 55 million inferences (151,000 images/day for one year). On diverse product categories Gemini achieves 52.3% and GPT-4 55.1%, while supervised YOLO cannot detect untrained classes. Cost-per-correct-detection favors Gemini ($0.00050) and GPT-4 ($0.00067) over YOLO ($0.143) at 100,000 inferences. We provide decision frameworks showing that optimal architecture choice depends on inference volume, category stability, budget, and accuracy requirements.

When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

TL;DR

This work systematically compares supervised YOLO with zero-shot vision-language models (Gemini Flash 2.5 and GPT-4V) through a Total Cost of Ownership lens, using 5,000 COCO-validation images and 500 novel product images. It demonstrates that YOLO attains higher accuracy (91.2% vs 68.5–71.3%) but incurs substantial annotation and training costs, while VLMs offer near-zero upfront costs with ongoing API charges and variable zero-shot accuracy depending on object web presence. The authors derive break-even inferences around 55 million for 100-category systems, showing that VLMs are cost-effective at typical volumes, with Gemini often the best option for cost-sensitive deployments and GPT-4 useful only in small-scale, accuracy-critical contexts. The study provides actionable frameworks and scenario-based guidance for practitioners to select detection architectures based on deployment volume, category stability, and budget, highlighting that practical economics can favor zero-shot approaches even in traditionally supervision-dominated settings.

Abstract

Object detection traditionally relies on costly manual annotation. We present the first comprehensive cost-effectiveness analysis comparing supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4). Evaluated on 5,000 stratified COCO images and 500 diverse product images, combined with Total Cost of Ownership modeling, we derive break-even thresholds for architecture selection. Results show supervised YOLO attains 91.2% accuracy versus 68.5% for Gemini and 71.3% for GPT-4 on standard categories; the annotation expense for a 100-category system is 0.00050) and GPT-4 (0.143) at 100,000 inferences. We provide decision frameworks showing that optimal architecture choice depends on inference volume, category stability, budget, and accuracy requirements.

Paper Structure

This paper contains 13 sections, 10 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Architectural comparison of supervised YOLO versus zero-shot Gemini VLM and GPT-4.
  • Figure 2: Total Cost of Ownership evolution across inference volumes for a 100-category detection system.
  • Figure 3: Tier 1: Highly web-prevalent consumer products (2020-2023).
  • Figure 4: Tier 2: Moderately prevalent products with niche coverage.
  • Figure 5: Tier 3: Rare specialized equipment with minimal web presence.
  • ...and 7 more figures