Table of Contents
Fetching ...

Improving Apple Object Detection with Occlusion-Enhanced Distillation

Liang Geng

TL;DR

This work designs an occlusion-enhanced dataset and proposes a multi-scale knowledge distillation strategy, which aids the student network in learning more generalized feature expressions that are less affected by the noise of individual image occlusions.

Abstract

Apples growing in natural environments often face severe visual obstructions from leaves and branches. This significantly increases the risk of false detections in object detection tasks, thereby escalating the challenge. Addressing this issue, we introduce a technique called "Occlusion-Enhanced Distillation" (OED). This approach utilizes occlusion information to regularize the learning of semantically aligned features on occluded datasets and employs Exponential Moving Average (EMA) to enhance training stability. Specifically, we first design an occlusion-enhanced dataset that integrates Grounding DINO and SAM methods to extract occluding elements such as leaves and branches from each sample, creating occlusion examples that reflect the natural growth state of fruits. Additionally, we propose a multi-scale knowledge distillation strategy, where the student network uses images with increased occlusions as inputs, while the teacher network employs images without natural occlusions. Through this setup, the strategy guides the student network to learn from the teacher across scales of semantic and local features alignment, effectively narrowing the feature distance between occluded and non-occluded targets and enhancing the robustness of object detection. Lastly, to improve the stability of the student network, we introduce the EMA strategy, which aids the student network in learning more generalized feature expressions that are less affected by the noise of individual image occlusions. Our method significantly outperforms current state-of-the-art techniques through extensive comparative experiments.

Improving Apple Object Detection with Occlusion-Enhanced Distillation

TL;DR

This work designs an occlusion-enhanced dataset and proposes a multi-scale knowledge distillation strategy, which aids the student network in learning more generalized feature expressions that are less affected by the noise of individual image occlusions.

Abstract

Apples growing in natural environments often face severe visual obstructions from leaves and branches. This significantly increases the risk of false detections in object detection tasks, thereby escalating the challenge. Addressing this issue, we introduce a technique called "Occlusion-Enhanced Distillation" (OED). This approach utilizes occlusion information to regularize the learning of semantically aligned features on occluded datasets and employs Exponential Moving Average (EMA) to enhance training stability. Specifically, we first design an occlusion-enhanced dataset that integrates Grounding DINO and SAM methods to extract occluding elements such as leaves and branches from each sample, creating occlusion examples that reflect the natural growth state of fruits. Additionally, we propose a multi-scale knowledge distillation strategy, where the student network uses images with increased occlusions as inputs, while the teacher network employs images without natural occlusions. Through this setup, the strategy guides the student network to learn from the teacher across scales of semantic and local features alignment, effectively narrowing the feature distance between occluded and non-occluded targets and enhancing the robustness of object detection. Lastly, to improve the stability of the student network, we introduce the EMA strategy, which aids the student network in learning more generalized feature expressions that are less affected by the noise of individual image occlusions. Our method significantly outperforms current state-of-the-art techniques through extensive comparative experiments.
Paper Structure (15 sections, 13 equations, 5 figures, 2 tables)

This paper contains 15 sections, 13 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Method Overview: We employ Multi-scale Feature Distillation to address the training challenges posed by substantial morphological differences in targets under severe occlusion. Specifically, Multi-scale Feature Distillation is divided into two parts: Candidate Distillation and Occlusion-Aware Distillation. Through weighted processes, relevant information is distilled from multi-scale features for knowledge transfer. Additionally, to enhance training stability, we utilize exponential moving averages.
  • Figure 2: Data Occlusion Augmentation: Utilizing annotated masks from the dataset, occluders are extracted using Grounding Dino 52 and SAM 43 to occlude the targets.
  • Figure 3: Model Architecture: We employ Deformable DETR 72 as the backbone model to extract multi-scale features. Based on these multi-scale features, we implement two levels of knowledge distillation: Candidate Distillation and Occlusion-Aware Distillation. The feature weights used in these distillation processes are derived from the detector's query responses and the fine-grained matching of occlusions.
  • Figure 4: The adapter employs a Vision Transformer (ViT) 86 structure consisting of two layers of self-attention mechanisms, specifically designed for further feature embedding.
  • Figure 5: Qualitative Analysis: Our designed multi-scale distillation framework effectively enhances the model's object detection capabilities under various occlusion conditions, while also demonstrating good robustness to different lighting conditions. The first row of the figure shows the original images, and the second row displays the corresponding detection results.