Table of Contents
Fetching ...

From Semantic To Instance: A Semi-Self-Supervised Learning Approach

Keyhan Najafian, Farhad Maleki, Lingling Jin, Ian Stavness

TL;DR

This work designs GLMask, an image-mask representation for the model to focus on shape, texture, and pattern while minimizing its dependence on color features, and develops a pipeline to generate semantic segmentation and then transform it into instance-level segmentation.

Abstract

Instance segmentation is essential for applications such as automated monitoring of plant health, growth, and yield. However, extensive effort is required to create large-scale datasets with pixel-level annotations of each object instance for developing instance segmentation models that restrict the use of deep learning in these areas. This challenge is more significant in images with densely packed, self-occluded objects, which are common in agriculture. To address this challenge, we propose a semi-self-supervised learning approach that requires minimal manual annotation to develop a high-performing instance segmentation model. We design GLMask, an image-mask representation for the model to focus on shape, texture, and pattern while minimizing its dependence on color features. We develop a pipeline to generate semantic segmentation and then transform it into instance-level segmentation. The proposed approach substantially outperforms the conventional instance segmentation models, establishing a state-of-the-art wheat head instance segmentation model with mAP@50 of 98.5%. Additionally, we assessed the proposed methodology on the general-purpose Microsoft COCO dataset, achieving a significant performance improvement of over 12.6% mAP@50. This highlights that the utility of our proposed approach extends beyond precision agriculture and applies to other domains, specifically those with similar data characteristics.

From Semantic To Instance: A Semi-Self-Supervised Learning Approach

TL;DR

This work designs GLMask, an image-mask representation for the model to focus on shape, texture, and pattern while minimizing its dependence on color features, and develops a pipeline to generate semantic segmentation and then transform it into instance-level segmentation.

Abstract

Instance segmentation is essential for applications such as automated monitoring of plant health, growth, and yield. However, extensive effort is required to create large-scale datasets with pixel-level annotations of each object instance for developing instance segmentation models that restrict the use of deep learning in these areas. This challenge is more significant in images with densely packed, self-occluded objects, which are common in agriculture. To address this challenge, we propose a semi-self-supervised learning approach that requires minimal manual annotation to develop a high-performing instance segmentation model. We design GLMask, an image-mask representation for the model to focus on shape, texture, and pattern while minimizing its dependence on color features. We develop a pipeline to generate semantic segmentation and then transform it into instance-level segmentation. The proposed approach substantially outperforms the conventional instance segmentation models, establishing a state-of-the-art wheat head instance segmentation model with mAP@50 of 98.5%. Additionally, we assessed the proposed methodology on the general-purpose Microsoft COCO dataset, achieving a significant performance improvement of over 12.6% mAP@50. This highlights that the utility of our proposed approach extends beyond precision agriculture and applies to other domains, specifically those with similar data characteristics.

Paper Structure

This paper contains 12 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Schematic overview of the proposed Semi-Self-Supervised Learning Framework. The methodology proceeds in three logical stages: (Left) GLMask Representation: To reduce color dependency and enforce structural learning, the input RGB image is decomposed into Grayscale and L-channel (of CIELAB) components, then concatenated with a Semantic Segmentation Mask prior. (Right) Two-Stage Training Strategy: The framework proceeds with (1) Synthetic Pre-training, where a YOLOv9 model is trained on a large-scale synthetic dataset generated via a cut-and-paste pipeline (Sec \ref{['subsec:data_synthesis']}); followed by (2) Domain Adaptation, where the model weights are transferred (horizontal dotted arrow) and fine-tuned on a rotation-augmented real dataset to bridge the domain gap (Sec \ref{['sebsec:data_generation']}).
  • Figure 2: Examples of video frames from the Harvest-ready growth stage ($\textit{W}_{tr}$, left) and Heading-complete stage ($\textit{W}_{va}$, right) wheat field domains, along with their corresponding human-annotated instance-specific masks (bottom row).
  • Figure 3: Examples illustrating the diversity within our human-annotated test sets. The top row represents samples from our single-domain test set $\textit{LateStage}_{te}$. The bottom row shows samples from our $18$ domains test set $\textit{GHD}_{te}$.
  • Figure 4: Detailed workflow of the data synthesis pipeline for instance segmentation, adapted from najafian2023semi. The process is strictly partitioned into Train Split (left, yellow) and Valid Split (right, green) streams to ensure distinct data distributions, utilizing wheat heads categorized into Harvest-ready ($\textit{W}_{tr}$) and Heading-complete ($\textit{W}_{va}$) stages along with their corresponding diversified background frames ($\textit{B}_{tr}$ and $\textit{B}_{va}$). The pipeline integrates three source components: background frames ($B$), synthetic fake foregrounds ($F$), and extracted real foregrounds ($R$). The synthesis proceeds sequentially: (1) Fake Object Overlay: Backgrounds are first populated with synthetic wheat templates to create a dense underlying texture; (2) Real Object Overlay & Mask Generation: Real wheat instances are superimposed using density-dependent selection logic (using smaller heads in blue dashed boxes for $\ge 50$ objects and larger heads in red dashed boxes for fewer objects) to simulate varying altitudes, while simultaneously generating pixel-perfect instance masks; and (3) Data Augmentation: The composited images undergo photometric and geometric transformations to produce the final synthetic training ($\textit{SYN}_{tr}$) and validation ($\textit{SYN}_{va}$) datasets.
  • Figure 5: Prediction Performance of RoAModel across the $18$ domains of $\textit{GHD}_{te}$ test set.
  • ...and 5 more figures