CarcassFormer: An End-to-end Transformer-based Framework for Simultaneous Localization, Segmentation and Classification of Poultry Carcass Defect

Minh Tran; Sang Truong; Arthur F. A. Fernandes; Michael T. Kidd; Ngan Le

CarcassFormer: An End-to-end Transformer-based Framework for Simultaneous Localization, Segmentation and Classification of Poultry Carcass Defect

Minh Tran, Sang Truong, Arthur F. A. Fernandes, Michael T. Kidd, Ngan Le

TL;DR

CarcassFormer addresses automated quality assessment of poultry carcasses by unifying localization, segmentation, and defect classification in an end-to-end Transformer framework. The method employs a four-component design (Backbone, Pixel Decoder, Mask-Attention Transformer Decoder, and Instance Mask/Classification Predictor) with multi-scale features and deformable attention to produce high-fidelity masks and defect labels. On the CarcassDefect dataset, CarcassFormer consistently surpasses CNN-based and Transformer-based baselines across detection, segmentation, and defect classification metrics (AP, AP@50, AP@75, AP@95) for both single- and multi-carcass frames, while maintaining competitive computational efficiency. The work demonstrates practical impact for real-world poultry processing by enabling accurate, scalable carcass quality assessment and highlights avenues for future enhancements such as video-based tracking and finer-grained defect taxonomy.

Abstract

In the food industry, assessing the quality of poultry carcasses during processing is a crucial step. This study proposes an effective approach for automating the assessment of carcass quality without requiring skilled labor or inspector involvement. The proposed system is based on machine learning (ML) and computer vision (CV) techniques, enabling automated defect detection and carcass quality assessment. To this end, an end-to-end framework called CarcassFormer is introduced. It is built upon a Transformer-based architecture designed to effectively extract visual representations while simultaneously detecting, segmenting, and classifying poultry carcass defects. Our proposed framework is capable of analyzing imperfections resulting from production and transport welfare issues, as well as processing plant stunner, scalder, picker, and other equipment malfunctions. To benchmark the framework, a dataset of 7,321 images was initially acquired, which contained both single and multiple carcasses per image. In this study, the performance of the CarcassFormer system is compared with other state-of-the-art (SOTA) approaches for both classification, detection, and segmentation tasks. Through extensive quantitative experiments, our framework consistently outperforms existing methods, demonstrating remarkable improvements across various evaluation metrics such as AP, AP@50, and AP@75. Furthermore, the qualitative results highlight the strengths of CarcassFormer in capturing fine details, including feathers, and accurately localizing and segmenting carcasses with high precision. To facilitate further research and collaboration, the pre-trained model and source code of CarcassFormer is available for research purposes at: \url{https://github.com/UARK-AICV/CarcassFormer}.

CarcassFormer: An End-to-end Transformer-based Framework for Simultaneous Localization, Segmentation and Classification of Poultry Carcass Defect

TL;DR

Abstract

Paper Structure (29 sections, 13 equations, 10 figures, 9 tables)

This paper contains 29 sections, 13 equations, 10 figures, 9 tables.

Introduction
Related Work
Image Segmentation
CNN-based instance segmentation
Transformer in Computer Vision
Materials and Methods
Data Collection
Data Annotation
Proposed method
Backbone
Pixel Decoder
Multi Scale Transformer Encoder
Per-pixel Embeddings Module
Mask-attention Transformer Decoder
Mask Predictor
...and 14 more sections

Figures (10)

Figure 1: Top: Overall flowchart of our proposed CarcassFormer consisting of four components: 1. Network Backbone; 2. Pixel Decoder; 3. Mask-Attention Transformer Decoder; 4. Instance Mask Class Prediction. Bottom: Details of third component Mask-Attention Transformer Decoder.
Figure 2: Camera setup for data collection. A black curtain is hung behind the shackle to provide a certain contrast to the carcasses. A camera is placed to capture the carcasses within the black curtain.
Figure 3: An overview image of the shooting location. The black curtain is hung on the wall behind the shackle.
Figure 4: Illustrations of data collected, which comprises (a) single carcass/instance per image/frame; (b) multiple carcass/instance per image/frame; (c )carcass/instance at different scale/resolution. The carcass is processed with various defects such as tearing of skin, feathers, broken/disjointed bones.
Figure 5: An illustration of Data Annotation Process. Each frame from the recorded video is annotated with bounding boxes for detection, masks for segmentation, and defect labels for classification.
...and 5 more figures

CarcassFormer: An End-to-end Transformer-based Framework for Simultaneous Localization, Segmentation and Classification of Poultry Carcass Defect

TL;DR

Abstract

CarcassFormer: An End-to-end Transformer-based Framework for Simultaneous Localization, Segmentation and Classification of Poultry Carcass Defect

Authors

TL;DR

Abstract

Table of Contents

Figures (10)