Table of Contents
Fetching ...

Automated Re-Identification of Holstein-Friesian Cattle in Dense Crowds

Phoenix Yu, Tilo Burghardt, Andrew W Dowsey, Neill W Campbell

TL;DR

This work tackles the challenge of re-identifying Holstein-Friesian cattle in densely packed groups where traditional detectors fail due to dazzle patterns. It introduces a two-stage detect-segment-identify pipeline that combines OWLv2 for open-vocabulary localisation with SAM2 for instance segmentation, followed by unsupervised contrastive learning to enable Re-ID without manual labeling. On nine days of farm CCTV data, the approach achieves up to 98.93% localisation accuracy and 94.82% Re-ID accuracy, substantially outperforming baseline bounding-box and SAM-based detectors. The method is training-free and transferable across cameras and farms, with code and data released for reproducibility, highlighting the practical potential of automated, labeling-free livestock surveillance in real-world settings.

Abstract

Holstein-Friesian detection and re-identification (Re-ID) methods capture individuals well when targets are spatially separate. However, existing approaches, including YOLO-based species detection, break down when cows group closely together. This is particularly prevalent for species which have outline-breaking coat patterns. To boost both effectiveness and transferability in this setting, we propose a new detect-segment-identify pipeline that leverages the Open-Vocabulary Weight-free Localisation and the Segment Anything models as pre-processing stages alongside Re-ID networks. To evaluate our approach, we publish a collection of nine days CCTV data filmed on a working dairy farm. Our methodology overcomes detection breakdown in dense animal groupings, resulting in a 98.93% accuracy. This significantly outperforms current oriented bounding box-driven, as well as SAM species detection baselines with accuracy improvements of 47.52% and 27.13%, respectively. We show that unsupervised contrastive learning can build on this to yield 94.82% Re-ID accuracy on our test data. Our work demonstrates that Re-ID in crowded scenarios is both practical as well as reliable in working farm settings with no manual intervention. Code and dataset are provided for reproducibility.

Automated Re-Identification of Holstein-Friesian Cattle in Dense Crowds

TL;DR

This work tackles the challenge of re-identifying Holstein-Friesian cattle in densely packed groups where traditional detectors fail due to dazzle patterns. It introduces a two-stage detect-segment-identify pipeline that combines OWLv2 for open-vocabulary localisation with SAM2 for instance segmentation, followed by unsupervised contrastive learning to enable Re-ID without manual labeling. On nine days of farm CCTV data, the approach achieves up to 98.93% localisation accuracy and 94.82% Re-ID accuracy, substantially outperforming baseline bounding-box and SAM-based detectors. The method is training-free and transferable across cameras and farms, with code and data released for reproducibility, highlighting the practical potential of automated, labeling-free livestock surveillance in real-world settings.

Abstract

Holstein-Friesian detection and re-identification (Re-ID) methods capture individuals well when targets are spatially separate. However, existing approaches, including YOLO-based species detection, break down when cows group closely together. This is particularly prevalent for species which have outline-breaking coat patterns. To boost both effectiveness and transferability in this setting, we propose a new detect-segment-identify pipeline that leverages the Open-Vocabulary Weight-free Localisation and the Segment Anything models as pre-processing stages alongside Re-ID networks. To evaluate our approach, we publish a collection of nine days CCTV data filmed on a working dairy farm. Our methodology overcomes detection breakdown in dense animal groupings, resulting in a 98.93% accuracy. This significantly outperforms current oriented bounding box-driven, as well as SAM species detection baselines with accuracy improvements of 47.52% and 27.13%, respectively. We show that unsupervised contrastive learning can build on this to yield 94.82% Re-ID accuracy on our test data. Our work demonstrates that Re-ID in crowded scenarios is both practical as well as reliable in working farm settings with no manual intervention. Code and dataset are provided for reproducibility.
Paper Structure (20 sections, 13 figures, 5 tables)

This paper contains 20 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Existing Detector Performance on Crowded Cows. Visualisations of performance with YOLO-v11x khanam2024yolov11(left) and RetinaNet yu2025holstein(right) on detecting cows without fine-tuning: the RetinaNet species detector module from MultiCamCows2024, pretrained on sparsely grouped cows, underperforms in detecting individuals from crowds. The YOLO-v11x model pretrained on the COCO dataset completely fails to detect and localise any cow.
  • Figure 2: Bounding Box Results from Text-Prompted Detection Models. The inference results of both OWLv2 minderer2023scaling and GroundingDINO liu2024grounding on our cow herds with no prior fine-tuning. GroundingDINO, though much faster than OWLv2 inference, fails to identify individuals when using the same text prompt as input.
  • Figure 3: Instance Segmentation on Crowded Holstein-Friesians. Illustrations of instance segmentations from our pipeline (left) and in failure of segmentation with GroundedSAM ren2024grounded when used without prior fine-tuning (right). GroundedSAM and our pipeline are, in principle, both capable of segmenting polygon masks based on text inputs. Yet GroundedSAM, segmenting one single region using the same text prompt 'cow' as our pipeline, or a varied selection of prompts such as 'a cow', 'individual cow', or 'cows', fails to separate and outline the torso of each cow individually. Other prompts, such as 'all instances of cows among a group' or '30 individual cows in the densely packed group', fail to provide any detection for OWLv2, or segmentation for GroundedSAM.
  • Figure 4: Processing Pipeline Overview. Our framework consists of two main components. For data acquisition (left) of both mask segmentations and ID, we initiate tracking and relocate targets at $1s$ intervals via OWLv2-enhanced and baseline SAM2 models. For ground truth image set generation, we manually label the bounding boxes and use them as the initial input for tracking. Then, two sections of evaluation were performed. We first apply IoU matching between SAM2 binary masks and ground truth to evaluate their localisation performance (upper-right). After integrating binary masks with their original frames to get the RGB masks, we applied a UCL module to evaluate Re-ID performance (lower-right).
  • Figure 5: Automated Data Acquisition. Our proposed approach for using prompts to generate segmentation masks has two stages: first in (red), we use a text prompt along with video frames as the inputs to a pre-trained OWLv2 minderer2023scaling model to get axis-aligned bounding boxes. Then, in (purple), the same video frames along with the bounding box outputs are used as inputs to a pre-trained SAM2 kirillov2023segment model to yield segmentations.
  • ...and 8 more figures