A Semi-Self-Supervised Approach for Dense-Pattern Video Object Segmentation
Keyhan Najafian, Farhad Maleki, Lingling Jin, Ian Stavness
TL;DR
This work tackles dense-pattern video object segmentation in agriculture, where many small, occluded wheat heads challenge pixel-level VOS. The authors present a semi-self-supervised DVOS framework that uses diffusion-augmented UNet with a two-stage data strategy: synthetic data to pretrain and pseudo-labeled real videos to fine-tune. The approach achieves strong generalization, reaching a Dice score of 0.79 on a drone-captured external test set for wheat-head segmentation, and demonstrates robustness to noisy initial masks compared with state-of-the-art QFRM-VOS methods like XMem. The method reduces annotation costs and can extend to other crops or dense-pattern domains such as crowd analysis or microscopy.
Abstract
Video object segmentation (VOS) -- predicting pixel-level regions for objects within each frame of a video -- is particularly challenging in agricultural scenarios, where videos of crops include hundreds of small, dense, and occluded objects (stems, leaves, flowers, pods) that sway and move unpredictably in the wind. Supervised training is the state-of-the-art for VOS, but it requires large, pixel-accurate, human-annotated videos, which are costly to produce for videos with many densely packed objects in each frame. To address these challenges, we proposed a semi-self-supervised spatiotemporal approach for dense-VOS (DVOS) using a diffusion-based method through multi-task (reconstruction and segmentation) learning. We train the model first with synthetic data that mimics the camera and object motion of real videos and then with pseudo-labeled videos. We evaluate our DVOS method for wheat head segmentation from a diverse set of videos (handheld, drone-captured, different field locations, and different growth stages -- spanning from Boot-stage to Wheat-mature and Harvest-ready). Despite using only a few manually annotated video frames, the proposed approach yielded a high-performing model, achieving a Dice score of 0.79 when tested on a drone-captured external test set. While our method was evaluated on wheat head segmentation, it can be extended to other crops and domains, such as crowd analysis or microscopic image analysis.
