Cross-Camera Cow Identification via Disentangled Representation Learning

Runcheng Wang; Yaru Chen; Guiguo Zhang; Honghua Jiang; Yongliang Qiao

Cross-Camera Cow Identification via Disentangled Representation Learning

Runcheng Wang, Yaru Chen, Guiguo Zhang, Honghua Jiang, Yongliang Qiao

TL;DR

The paper tackles cross-camera cow identification in uncontrolled farm environments by introducing a disentangled representation framework grounded in Subspace Identification Guarantee (SIG) theory. It models image formation with four latent subspaces, $z={z_1,z_2,z_3,z_4}$, where $z_3$ captures intrinsic identity features and $z_1,z_2$ encode camera and viewpoint interference, while $z_4$ covers universal appearance; a conditional decision mechanism combines $z_2$, $z_3$, and the camera index $u$ for robust identity inference. The approach uses trunk-focused extraction via YOLOv11n, a variational encoder with ELBO and a camera-predictor constraint to enforce subspace disentanglement, and a class-aware centroid alignment to mitigate label distribution shifts across cameras. On the multi-view CCCI60 dataset, the method achieves an average accuracy of 86.0%, significantly surpassing a Source-only baseline of 51.9% and a strong domain-adaptation baseline (iMSDA) at 79.8%, demonstrating improved cross-camera generalization in realistic farming settings. These results suggest a principled, physics-informed route to reliable non-contact livestock monitoring and have practical implications for scalable, multi-node smart farming deployments.

Abstract

Precise identification of individual cows is a fundamental prerequisite for comprehensive digital management in smart livestock farming. While existing animal identification methods excel in controlled, single-camera settings, they face severe challenges regarding cross-camera generalization. When models trained on source cameras are deployed to new monitoring nodes characterized by divergent illumination, backgrounds, viewpoints, and heterogeneous imaging properties, recognition performance often degrades dramatically. This limits the large-scale application of non-contact technologies in dynamic, real-world farming environments. To address this challenge, this study proposes a cross-camera cow identification framework based on disentangled representation learning. This framework leverages the Subspace Identifiability Guarantee (SIG) theory in the context of bovine visual recognition. By modeling the underlying physical data generation process, we designed a principle-driven feature disentanglement module that decomposes observed images into multiple orthogonal latent subspaces. This mechanism effectively isolates stable, identity-related biometric features that remain invariant across cameras, thereby substantially improving generalization to unseen cameras. We constructed a high-quality dataset spanning five distinct camera nodes, covering heterogeneous acquisition devices and complex variations in lighting and angles. Extensive experiments across seven cross-camera tasks demonstrate that the proposed method achieves an average accuracy of 86.0%, significantly outperforming the Source-only Baseline (51.9%) and the strongest cross-camera baseline method (79.8%). This work establishes a subspace-theoretic feature disentanglement framework for collaborative cross-camera cow identification, offering a new paradigm for precise animal monitoring in uncontrolled smart farming environments.

Cross-Camera Cow Identification via Disentangled Representation Learning

TL;DR

, where

captures intrinsic identity features and

encode camera and viewpoint interference, while

covers universal appearance; a conditional decision mechanism combines

, and the camera index

for robust identity inference. The approach uses trunk-focused extraction via YOLOv11n, a variational encoder with ELBO and a camera-predictor constraint to enforce subspace disentanglement, and a class-aware centroid alignment to mitigate label distribution shifts across cameras. On the multi-view CCCI60 dataset, the method achieves an average accuracy of 86.0%, significantly surpassing a Source-only baseline of 51.9% and a strong domain-adaptation baseline (iMSDA) at 79.8%, demonstrating improved cross-camera generalization in realistic farming settings. These results suggest a principled, physics-informed route to reliable non-contact livestock monitoring and have practical implications for scalable, multi-node smart farming deployments.

Abstract

Paper Structure (26 sections, 15 equations, 6 figures, 3 tables)

This paper contains 26 sections, 15 equations, 6 figures, 3 tables.

Introduction
Materials and methods
Cross-Camera Observation Nodes
Datasets: CowId60
Problem formulation and physical generative logic
Methodology overview
Automated cattle trunk extraction
Implementation of visual feature disentanglement and conditional identification architecture
Deep feature mapping and probabilistic modeling
Logic of disentanglement constraints on latent subspaces
Construction of the conditional joint decision-making mechanism
Optimization objectives
Identity classification module ($\mathcal{L}_{\text{identity}}$)
Feature Disentanglement Module $\mathcal{L}_{\text{disentangle}}$
Feature Alignment Module ($\mathcal{L}_{\text{align}}$)
...and 11 more sections

Figures (6)

Figure 1: Layout of the cross-camera data acquisition system and scene examples. The central schematic illustrates the complete route of cows traveling from Barn No. 6 to the milking parlor and returning, where numbered circles ① through ⑥ indicate the positions of the six cameras. The surrounding panels ① through ⑥ display real-world scene images captured by the cameras at the corresponding positions. The specific layout is as follows: ① Barn Exit; ② and ③ Walking Aisle; ④ and ⑤ Milking Parlor Entrance; and ⑥ Milking Parlor Exit.
Figure 2: Visualization of samples from the cross-camera cow identification dataset. The figure displays representative images of four different cows (ID: 001, 024, 048, 060).
Figure 3: Physical generative graph. Observed variables: Camera index $u$ , individual identity label $y$ , and cow image $x$ . Latent variables $z = \left\{ {{z_1},{z_2},{z_3},{z_4}} \right\}$
Figure 4: Schematic overview of the proposed framework
Figure 5: Architecture of the proposed model
...and 1 more figures

Cross-Camera Cow Identification via Disentangled Representation Learning

TL;DR

Abstract

Cross-Camera Cow Identification via Disentangled Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)