AnimalFormer: Multimodal Vision Framework for Behavior-based Precision Livestock Farming

Ahmed Qazi; Taha Razzaq; Asim Iqbal

AnimalFormer: Multimodal Vision Framework for Behavior-based Precision Livestock Farming

Ahmed Qazi, Taha Razzaq, Asim Iqbal

TL;DR

AnimalFormer addresses non-invasive, comprehensive livestock behavior analytics by fusing GroundingDINO for detection, HQ-SAM for segmentation, and ViTPose for pose estimation into an end-to-end multimodal pipeline. It applies UMAP and clustering to derive gait, grazing, and resting patterns from a sheep dataset, revealing relationships between gait diversity and speed, and the impact of social context on grazing. The approach demonstrates broad applicability across species and video resolutions, offering a practical tool for welfare monitoring, productivity optimization, and data-driven farm management. By integrating state-of-the-art vision transformers, AnimalFormer enables fine-grained behavioral analytics without tagging, supporting scalable precision livestock farming.

Abstract

We introduce a multimodal vision framework for precision livestock farming, harnessing the power of GroundingDINO, HQSAM, and ViTPose models. This integrated suite enables comprehensive behavioral analytics from video data without invasive animal tagging. GroundingDINO generates accurate bounding boxes around livestock, while HQSAM segments individual animals within these boxes. ViTPose estimates key body points, facilitating posture and movement analysis. Demonstrated on a sheep dataset with grazing, running, sitting, standing, and walking activities, our framework extracts invaluable insights: activity and grazing patterns, interaction dynamics, and detailed postural evaluations. Applicable across species and video resolutions, this framework revolutionizes non-invasive livestock monitoring for activity detection, counting, health assessments, and posture analyses. It empowers data-driven farm management, optimizing animal welfare and productivity through AI-powered behavioral understanding.

AnimalFormer: Multimodal Vision Framework for Behavior-based Precision Livestock Farming

TL;DR

Abstract

Paper Structure (17 sections, 7 equations, 3 figures)

This paper contains 17 sections, 7 equations, 3 figures.

Introduction
Related work
Methods
Dataset
AnimalFormer
Pose estimation
Animal detection
Animal segmentation
Behavior analytics
Running videos
Grazing videos
Resting videos
Results & Discussion
Indirect relationship between animals' gait diversity and running speed
Enhanced grazing activity in isolation
...and 2 more sections

Figures (3)

Figure 1: Our integrated analysis framework, designed for comprehensive behavioral understanding of sheep in a dataset. The framework combines ViTPose and Grounding DINO for pose estimation and contextual understanding respectively, with the high-quality instance segmentation capabilities of HQ-SAM. By fusing these components, our pipeline provides precise keypoints and segmentation masks, essential for in-depth ethological studies. This block diagram provides a clear overview of the data flow and processing steps within our end-to-end solution.
Figure 2: Qualitative outputs of our framework depicting various behaviors of sheep. Top row: Running frames with keypoints and bounding boxes illustrating movement dynamics. Middle row: Grazing frames showcasing sheep engaged in feeding with extracted poses and segmentation masks highlighting the focus areas. Bottom row: Resting frames with segmentation masks delineating individual sheep in a state of repose. Each behavior is analyzed through a combination of visual features extracted from images.
Figure 3: Behavior Analytics. A. UMAP representation of the unique gait patterns of the sheep extracted from their running videos. B. Speed profile of the animal extracted at the commencement, midpoint, and conclusion of their running videos. C. The unique clusters of existing sheep gait patterns. D. Spread of different patterns within a single animal across different clusters. E. The grazing activity of the sheep in herds vs single. F. UMAP representation of the unique resting pattern of the animals in herd vs single.

AnimalFormer: Multimodal Vision Framework for Behavior-based Precision Livestock Farming

TL;DR

Abstract

AnimalFormer: Multimodal Vision Framework for Behavior-based Precision Livestock Farming

Authors

TL;DR

Abstract

Table of Contents

Figures (3)