When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models
Samer Al-Hamadani
TL;DR
This work systematically compares supervised YOLO with zero-shot vision-language models (Gemini Flash 2.5 and GPT-4V) through a Total Cost of Ownership lens, using 5,000 COCO-validation images and 500 novel product images. It demonstrates that YOLO attains higher accuracy (91.2% vs 68.5–71.3%) but incurs substantial annotation and training costs, while VLMs offer near-zero upfront costs with ongoing API charges and variable zero-shot accuracy depending on object web presence. The authors derive break-even inferences around 55 million for 100-category systems, showing that VLMs are cost-effective at typical volumes, with Gemini often the best option for cost-sensitive deployments and GPT-4 useful only in small-scale, accuracy-critical contexts. The study provides actionable frameworks and scenario-based guidance for practitioners to select detection architectures based on deployment volume, category stability, and budget, highlighting that practical economics can favor zero-shot approaches even in traditionally supervision-dominated settings.
Abstract
Object detection traditionally relies on costly manual annotation. We present the first comprehensive cost-effectiveness analysis comparing supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4). Evaluated on 5,000 stratified COCO images and 500 diverse product images, combined with Total Cost of Ownership modeling, we derive break-even thresholds for architecture selection. Results show supervised YOLO attains 91.2% accuracy versus 68.5% for Gemini and 71.3% for GPT-4 on standard categories; the annotation expense for a 100-category system is $10,800, and the accuracy advantage only pays off beyond 55 million inferences (151,000 images/day for one year). On diverse product categories Gemini achieves 52.3% and GPT-4 55.1%, while supervised YOLO cannot detect untrained classes. Cost-per-correct-detection favors Gemini ($0.00050) and GPT-4 ($0.00067) over YOLO ($0.143) at 100,000 inferences. We provide decision frameworks showing that optimal architecture choice depends on inference volume, category stability, budget, and accuracy requirements.
