When Language Model Guides Vision: Grounding DINO for Cattle Muzzle Detection
Rabin Dulal, Lihong Zheng, Muhammad Ashad Kabir
TL;DR
This work tackles automated cattle muzzle detection by proposing a zero-shot framework that leverages Grounding DINO and natural language prompts to localize muzzles without task-specific training. By evaluating seven ZSD models across three diverse datasets, the authors demonstrate that Grounding DINO achieves the strongest performance (mAP@0.5 ≈ 0.768 and mAP@[0.50:0.95] ≈ 0.340) while remaining annotation-free, highlighting superior cross-domain generalization and deployment practicality. The study also contrasts zero-shot guidance with supervised YOLO methods, showing that lattice data requirements for supervised models are substantial, whereas the proposed approach enables scalable, low-cost muzzle localization across breeds and environments. Overall, the results indicate a promising direction for livestock monitoring where prompt-driven vision-language models can reduce labeling burdens and support rapid adaptation to new contexts, albeit with caveats around prompt design and potential language-model-related errors.
Abstract
Muzzle patterns are among the most effective biometric traits for cattle identification. Fast and accurate detection of the muzzle region as the region of interest is critical to automatic visual cattle identification.. Earlier approaches relied on manual detection, which is labor-intensive and inconsistent. Recently, automated methods using supervised models like YOLO have become popular for muzzle detection. Although effective, these methods require extensive annotated datasets and tend to be trained data-dependent, limiting their performance on new or unseen cattle. To address these limitations, this study proposes a zero-shot muzzle detection framework based on Grounding DINO, a vision-language model capable of detecting muzzles without any task-specific training or annotated data. This approach leverages natural language prompts to guide detection, enabling scalable and flexible muzzle localization across diverse breeds and environments. Our model achieves a mean Average Precision (mAP)@0.5 of 76.8\%, demonstrating promising performance without requiring annotated data. To our knowledge, this is the first research to provide a real-world, industry-oriented, and annotation-free solution for cattle muzzle detection. The framework offers a practical alternative to supervised methods, promising improved adaptability and ease of deployment in livestock monitoring applications.
