Table of Contents
Fetching ...

Knowledge Distillation in YOLOX-ViT for Side-Scan Sonar Object Detection

Martin Aubard, László Antal, Ana Madureira, Erika Ábrahám

TL;DR

This work tackles efficient underwater object detection for side-scan sonar imagery by integrating a Visual Transformer into YOLOX (forming YOLOX-ViT) and applying knowledge distillation to train compact models. It introduces an offline KD pipeline using a larger YOLOX-L teacher to improve YOLOX-Nano variants and presents a dedicated SWDD dataset for wall detection. Empirical results demonstrate that the ViT layer boosts detection accuracy in underwater environments, while KD substantially reduces false positives in wall detection, enabling smaller, deployment-friendly models. The study advances onboard AUV perception by combining architectural enhancements with distillation while offering a public dataset and code, with future work focusing on dataset expansion, online augmentation in KD, and safety verification for real-world use.

Abstract

In this paper we present YOLOX-ViT, a novel object detection model, and investigate the efficacy of knowledge distillation for model size reduction without sacrificing performance. Focused on underwater robotics, our research addresses key questions about the viability of smaller models and the impact of the visual transformer layer in YOLOX. Furthermore, we introduce a new side-scan sonar image dataset, and use it to evaluate our object detector's performance. Results show that knowledge distillation effectively reduces false positives in wall detection. Additionally, the introduced visual transformer layer significantly improves object detection accuracy in the underwater environment. The source code of the knowledge distillation in the YOLOX-ViT is at https://github.com/remaro-network/KD-YOLOX-ViT.

Knowledge Distillation in YOLOX-ViT for Side-Scan Sonar Object Detection

TL;DR

This work tackles efficient underwater object detection for side-scan sonar imagery by integrating a Visual Transformer into YOLOX (forming YOLOX-ViT) and applying knowledge distillation to train compact models. It introduces an offline KD pipeline using a larger YOLOX-L teacher to improve YOLOX-Nano variants and presents a dedicated SWDD dataset for wall detection. Empirical results demonstrate that the ViT layer boosts detection accuracy in underwater environments, while KD substantially reduces false positives in wall detection, enabling smaller, deployment-friendly models. The study advances onboard AUV perception by combining architectural enhancements with distillation while offering a public dataset and code, with future work focusing on dataset expansion, online augmentation in KD, and safety verification for real-world use.

Abstract

In this paper we present YOLOX-ViT, a novel object detection model, and investigate the efficacy of knowledge distillation for model size reduction without sacrificing performance. Focused on underwater robotics, our research addresses key questions about the viability of smaller models and the impact of the visual transformer layer in YOLOX. Furthermore, we introduce a new side-scan sonar image dataset, and use it to evaluate our object detector's performance. Results show that knowledge distillation effectively reduces false positives in wall detection. Additionally, the introduced visual transformer layer significantly improves object detection accuracy in the underwater environment. The source code of the knowledge distillation in the YOLOX-ViT is at https://github.com/remaro-network/KD-YOLOX-ViT.
Paper Structure (13 sections, 22 equations, 3 figures, 2 tables)

This paper contains 13 sections, 22 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Knowledge distillation: blue nodes indicate the teacher, while yellow ones show the student. The combined hard and soft loss is used to train the student network.
  • Figure 2: The YOLOX-ViT model with its three main parts: i. backbone for the feature extraction, ii. neck characterized by a feature pyramid network (FPN) connecting the backbone and the head, and iii. (decoupled) head for the bounding box regression and classification tasks. The Visual Transformer layer (ViT) is located between the backbone and the neck represented by the red arrow, in contrast the basic YOLOX architecture is represented by the blue dotted line. For further explanation of the individual blocks, we refer to Figure \ref{['fig:yolo-blocks']} of the appendix.
  • Figure 3: YOLOX-ViT Architecture additional description. Each module of Figure \ref{['fig:yolox-vit']} are explained on a lower level.