Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes
Nirmal Elamon, Rouzbeh Davoudi
TL;DR
This study addresses the challenge of efficient visual understanding in low-data regimes by comparing traditional CNNs, zero-shot multi-modal LLMs, and fine-tuned multi-modal LLMs for the task of artificial text overlay detection. The authors implement and evaluate four variants, including a lightweight fine-tuned Phi-3.5 Vision model trained on under 1,000 labeled images, with a careful training regimen. Results show that the fine-tuned LLM significantly outperforms CNNs and zero-shot baselines, achieving high precision and competitive recall with minimal supervision, thereby illustrating the data-efficient potential of vision-language models. The work provides practical guidance for adapting multi-modal transformers in low-resource settings and shares code on GitHub to enable reproducibility and further exploration in related tasks.
Abstract
The field of object detection and understanding is rapidly evolving, driven by advances in both traditional CNN-based models and emerging multi-modal large language models (LLMs). While CNNs like ResNet and YOLO remain highly effective for image-based tasks, recent transformer-based LLMs introduce new capabilities such as dynamic context reasoning, language-guided prompts, and holistic scene understanding. However, when used out-of-the-box, the full potential of LLMs remains underexploited, often resulting in suboptimal performance on specialized visual tasks. In this work, we conduct a comprehensive comparison of fine-tuned traditional CNNs, zero-shot pre-trained multi-modal LLMs, and fine-tuned multi-modal LLMs on the challenging task of artificial text overlay detection in images. A key contribution of our study is demonstrating that LLMs can be effectively fine-tuned on very limited data (fewer than 1,000 images) to achieve up to 36% accuracy improvement, matching or surpassing CNN-based baselines that typically require orders of magnitude more data. By exploring how language-guided models can be adapted for precise visual understanding with minimal supervision, our work contributes to the broader effort of bridging vision and language, offering novel insights into efficient cross-modal learning strategies. These findings highlight the adaptability and data efficiency of LLM-based approaches for real-world object detection tasks and provide actionable guidance for applying multi-modal transformers in low-resource visual environments. To support continued progress in this area, we have made the code used to fine-tune the models available in our GitHub, enabling future improvements and reuse in related applications.
