Table of Contents
Fetching ...

A Semi-Self-Supervised Approach for Dense-Pattern Video Object Segmentation

Keyhan Najafian, Farhad Maleki, Lingling Jin, Ian Stavness

TL;DR

This work tackles dense-pattern video object segmentation in agriculture, where many small, occluded wheat heads challenge pixel-level VOS. The authors present a semi-self-supervised DVOS framework that uses diffusion-augmented UNet with a two-stage data strategy: synthetic data to pretrain and pseudo-labeled real videos to fine-tune. The approach achieves strong generalization, reaching a Dice score of 0.79 on a drone-captured external test set for wheat-head segmentation, and demonstrates robustness to noisy initial masks compared with state-of-the-art QFRM-VOS methods like XMem. The method reduces annotation costs and can extend to other crops or dense-pattern domains such as crowd analysis or microscopy.

Abstract

Video object segmentation (VOS) -- predicting pixel-level regions for objects within each frame of a video -- is particularly challenging in agricultural scenarios, where videos of crops include hundreds of small, dense, and occluded objects (stems, leaves, flowers, pods) that sway and move unpredictably in the wind. Supervised training is the state-of-the-art for VOS, but it requires large, pixel-accurate, human-annotated videos, which are costly to produce for videos with many densely packed objects in each frame. To address these challenges, we proposed a semi-self-supervised spatiotemporal approach for dense-VOS (DVOS) using a diffusion-based method through multi-task (reconstruction and segmentation) learning. We train the model first with synthetic data that mimics the camera and object motion of real videos and then with pseudo-labeled videos. We evaluate our DVOS method for wheat head segmentation from a diverse set of videos (handheld, drone-captured, different field locations, and different growth stages -- spanning from Boot-stage to Wheat-mature and Harvest-ready). Despite using only a few manually annotated video frames, the proposed approach yielded a high-performing model, achieving a Dice score of 0.79 when tested on a drone-captured external test set. While our method was evaluated on wheat head segmentation, it can be extended to other crops and domains, such as crowd analysis or microscopic image analysis.

A Semi-Self-Supervised Approach for Dense-Pattern Video Object Segmentation

TL;DR

This work tackles dense-pattern video object segmentation in agriculture, where many small, occluded wheat heads challenge pixel-level VOS. The authors present a semi-self-supervised DVOS framework that uses diffusion-augmented UNet with a two-stage data strategy: synthetic data to pretrain and pseudo-labeled real videos to fine-tune. The approach achieves strong generalization, reaching a Dice score of 0.79 on a drone-captured external test set for wheat-head segmentation, and demonstrates robustness to noisy initial masks compared with state-of-the-art QFRM-VOS methods like XMem. The method reduces annotation costs and can extend to other crops or dense-pattern domains such as crowd analysis or microscopy.

Abstract

Video object segmentation (VOS) -- predicting pixel-level regions for objects within each frame of a video -- is particularly challenging in agricultural scenarios, where videos of crops include hundreds of small, dense, and occluded objects (stems, leaves, flowers, pods) that sway and move unpredictably in the wind. Supervised training is the state-of-the-art for VOS, but it requires large, pixel-accurate, human-annotated videos, which are costly to produce for videos with many densely packed objects in each frame. To address these challenges, we proposed a semi-self-supervised spatiotemporal approach for dense-VOS (DVOS) using a diffusion-based method through multi-task (reconstruction and segmentation) learning. We train the model first with synthetic data that mimics the camera and object motion of real videos and then with pseudo-labeled videos. We evaluate our DVOS method for wheat head segmentation from a diverse set of videos (handheld, drone-captured, different field locations, and different growth stages -- spanning from Boot-stage to Wheat-mature and Harvest-ready). Despite using only a few manually annotated video frames, the proposed approach yielded a high-performing model, achieving a Dice score of 0.79 when tested on a drone-captured external test set. While our method was evaluated on wheat head segmentation, it can be extended to other crops and domains, such as crowd analysis or microscopic image analysis.
Paper Structure (18 sections, 4 equations, 12 figures, 4 tables)

This paper contains 18 sections, 4 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Architectural choices for segmentation: (A) Conventional image segmentation, segmenting the query image ronneberger2015u; (B) Multi-task learning, jointly learning image reconstruction and image segmentation ghanbari2024semi; (C) Conventional VOS, using reference frames, first reference frame mask, and query frame as model inputs to segment the query image Wang2021SwiftNetRV; (D) Our approach uniquely leverages reference frames within a multitasking framework to predict both the subsequent query frame and its corresponding mask, eliminating the need for reference frame annotations or the query frame itself as model input.
  • Figure 2: Overview of the proposed UNet-style ronneberger2015u architecture for DVOS.
  • Figure 3: Synthetic videos show color-augmented fake wheat heads and masked real heads, isolated from the canopy and stem, overlaid on uniform background frames with random head-level movements. Real videos depict actual wheat fields, capturing normal motion in dense wheat spikes within the field.
  • Figure 4: Representative examples of the test sets: the dashed orange box highlights the diversity of the pseudo-labeled dataset, and the blue boxes show manually annotated test set examples with overlaid annotations.
  • Figure 5: Performance visualization of the $\mathbb{VM}_{\text{Pseu}}^{}$ model across different test sets. The first two columns depict masks overlaid on the corresponding images. Each block forms four different samples, which are consistently arranged within the same grid cell across Ground Truth, Mask Prediction, and Image Prediction columns.
  • ...and 7 more figures