Table of Contents
Fetching ...

DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection

Gautam Rajendrakumar Gare, Neehar Peri, Matvei Popov, Shruti Jain, John Galeotti, Deva Ramanan

Abstract

Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO

DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection

Abstract

Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO
Paper Structure (13 sections, 9 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 13 sections, 9 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Detection Prompt Optimization. We cast the problem of gradient-free few-shot object detection as multimodal in-context learning (ICL). Here, a frozen multi-modal LLM (MLLM) is presented with a class name, a textual description, and a few visual examples ( left), similar to the instructions given to a human annotator tasked with annotating that class robicheaux2025roboflow100. Rather than presenting the visual examples directly to the MLLM, we find that it is far more effective to use them to optimize a better prompt ( right) via prompt optimization; we use another black-box MLLM to discover prompt instructions that perform better on the few-shot training dataset. These improved instructions are then fed into the target MLLM.
  • Figure 2: Contrastive Prompt Refinement Reduces Class Confusion. At each iteration, we use the current class description to query the MLLM and obtain a set of candidate detections on the training set. From these predictions, we identify true positive, false positive, and false negative detections. The prompt is then refined by asking the MLLM to adjust the class definition such that it explicitly excludes false positives and includes false negatives. This iterative procedure is repeated until convergence. We illustrate a single refinement step above, where the highlighted text indicates newly added details that encode the desired correction differentiating Serve from Attack.
  • Figure 3: Improvement from Contrastive Prompt Refinement. We compare the original baseline prompt against both the initial DetPO prompt and the final optimized DetPO prompt (left). These results demonstrate that the DetPO optimized prompts consistently improves detection accuracy across nearly all categories. Further, we show that successive refinement iterations improves performance on the training set (right). Importantly, we plot the change in mAP relative to the initial DetPO prompt. Most domains show strong initial gains that begin to plateau around iteration 6, with the Flora & Fauna and Aerial categories showing the largest overall improvements (+2.8 and +2.5, respectively).
  • Figure 4: Detection Confusion Matrix. We compare Qwen3-VL (30B-A3B), Qwen3-VL with DetPO, and with VQA Score across the Actions, Wb-Prova, and Defect Detections datasets. We find that DetPO and VQA Score consistently resolve baseline class imbalances. Notably, our proposed approach improves true positive rates for underrepresented classes ( Juvenile, Piglet) and nuanced actions ( Defense, Serve), while mitigating aggressive false positive predictions in defect detection.
  • Figure 5: Detection Errors. We diagnose errors in the baseline Qwen3-VL (30B-A3B) model (left), the proposed DetPO method (center), and DetPO + VQA Score (right) with TIDE bolya2020tide. The top row shows the relative distribution of error types, while the bottom row describes the absolute error counts and the overall false positive (FP) versus false negative (FN) rates. DetPO notably reduces classification errors compared to the baseline. While adding in VQA Score successfully reduces overall false positives, it shifts the primary error bottleneck to localization (Loc) and significantly increases missed detections (FN).
  • ...and 4 more figures