AnimalFormer: Multimodal Vision Framework for Behavior-based Precision Livestock Farming
Ahmed Qazi, Taha Razzaq, Asim Iqbal
TL;DR
AnimalFormer addresses non-invasive, comprehensive livestock behavior analytics by fusing GroundingDINO for detection, HQ-SAM for segmentation, and ViTPose for pose estimation into an end-to-end multimodal pipeline. It applies UMAP and clustering to derive gait, grazing, and resting patterns from a sheep dataset, revealing relationships between gait diversity and speed, and the impact of social context on grazing. The approach demonstrates broad applicability across species and video resolutions, offering a practical tool for welfare monitoring, productivity optimization, and data-driven farm management. By integrating state-of-the-art vision transformers, AnimalFormer enables fine-grained behavioral analytics without tagging, supporting scalable precision livestock farming.
Abstract
We introduce a multimodal vision framework for precision livestock farming, harnessing the power of GroundingDINO, HQSAM, and ViTPose models. This integrated suite enables comprehensive behavioral analytics from video data without invasive animal tagging. GroundingDINO generates accurate bounding boxes around livestock, while HQSAM segments individual animals within these boxes. ViTPose estimates key body points, facilitating posture and movement analysis. Demonstrated on a sheep dataset with grazing, running, sitting, standing, and walking activities, our framework extracts invaluable insights: activity and grazing patterns, interaction dynamics, and detailed postural evaluations. Applicable across species and video resolutions, this framework revolutionizes non-invasive livestock monitoring for activity detection, counting, health assessments, and posture analyses. It empowers data-driven farm management, optimizing animal welfare and productivity through AI-powered behavioral understanding.
