Table of Contents
Fetching ...

Intelligent Fish Detection System with Similarity-Aware Transformer

Shengchen Li, Haobo Zuo, Changhong Fu, Zhiyong Wang, Zhiqiang Xu

TL;DR

This work tackles the challenge of real-time, accurate fish detection in water-land transfer, where dense and visually similar fish impede conventional detectors. It introduces FishViT, a lightweight Transformer-based system with a ResNet18 backbone, a similarity-aware multi-level encoder (parallel branches with pooling positional encoding), and a soft-threshold attention mechanism that denoises backgrounds and sharpens fish edges. Key contributions include the STAttention module, a parallel multi-level encoder for robust multi-scale features, and an IoU-aware decoder for end-to-end detection, validated on a newly collected 85-sequence benchmark achieving up to 82 FPS and AP$_{50}$ of 94.7, with strong performance under challenging attributes. The proposed system demonstrates practical viability for on-device, high-speed fish detection, enabling downstream tasks such as tracking, sizing, and counting in real-world water-land transfer setups.

Abstract

Fish detection in water-land transfer has significantly contributed to the fishery. However, manual fish detection in crowd-collaboration performs inefficiently and expensively, involving insufficient accuracy. To further enhance the water-land transfer efficiency, improve detection accuracy, and reduce labor costs, this work designs a new type of lightweight and plug-and-play edge intelligent vision system to automatically conduct fast fish detection with high-speed camera. Moreover, a novel similarity-aware vision Transformer for fast fish detection (FishViT) is proposed to onboard identify every single fish in a dense and similar group. Specifically, a novel similarity-aware multi-level encoder is developed to enhance multi-scale features in parallel, thereby yielding discriminative representations for varying-size fish. Additionally, a new soft-threshold attention mechanism is introduced, which not only effectively eliminates background noise from images but also accurately recognizes both the edge details and overall features of different similar fish. 85 challenging video sequences with high framerate and high-resolution are collected to establish a benchmark from real fish water-land transfer scenarios. Exhaustive evaluation conducted with this challenging benchmark has proved the robustness and effectiveness of FishViT with over 80 FPS. Real work scenario tests validate the practicality of the proposed method. The code and demo video are available at https://github.com/vision4robotics/FishViT.

Intelligent Fish Detection System with Similarity-Aware Transformer

TL;DR

This work tackles the challenge of real-time, accurate fish detection in water-land transfer, where dense and visually similar fish impede conventional detectors. It introduces FishViT, a lightweight Transformer-based system with a ResNet18 backbone, a similarity-aware multi-level encoder (parallel branches with pooling positional encoding), and a soft-threshold attention mechanism that denoises backgrounds and sharpens fish edges. Key contributions include the STAttention module, a parallel multi-level encoder for robust multi-scale features, and an IoU-aware decoder for end-to-end detection, validated on a newly collected 85-sequence benchmark achieving up to 82 FPS and AP of 94.7, with strong performance under challenging attributes. The proposed system demonstrates practical viability for on-device, high-speed fish detection, enabling downstream tasks such as tracking, sizing, and counting in real-world water-land transfer setups.

Abstract

Fish detection in water-land transfer has significantly contributed to the fishery. However, manual fish detection in crowd-collaboration performs inefficiently and expensively, involving insufficient accuracy. To further enhance the water-land transfer efficiency, improve detection accuracy, and reduce labor costs, this work designs a new type of lightweight and plug-and-play edge intelligent vision system to automatically conduct fast fish detection with high-speed camera. Moreover, a novel similarity-aware vision Transformer for fast fish detection (FishViT) is proposed to onboard identify every single fish in a dense and similar group. Specifically, a novel similarity-aware multi-level encoder is developed to enhance multi-scale features in parallel, thereby yielding discriminative representations for varying-size fish. Additionally, a new soft-threshold attention mechanism is introduced, which not only effectively eliminates background noise from images but also accurately recognizes both the edge details and overall features of different similar fish. 85 challenging video sequences with high framerate and high-resolution are collected to establish a benchmark from real fish water-land transfer scenarios. Exhaustive evaluation conducted with this challenging benchmark has proved the robustness and effectiveness of FishViT with over 80 FPS. Real work scenario tests validate the practicality of the proposed method. The code and demo video are available at https://github.com/vision4robotics/FishViT.
Paper Structure (18 sections, 9 equations, 7 figures, 2 tables)

This paper contains 18 sections, 9 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Fish detection in water-land transfer and the proposed fish detection method (FishViT). The picture with the dotted box represents the traditional manual detection mode, which is inefficient and costly. The proposed intelligent detection system significantly improves production efficiency, enhances accuracy, and reduces costs. Fresh fish slide fast from the pipeline into vehicles for sale and the intelligent detection device stand detects above the pipeline. The proposed FishViT is embedded into the intelligent detection system to effectively realize high-speed fish detection. FishViT mainly consists of three components: Backbone, Similarity-aware multi-level encoder, and Decoder & Head. The Similarity-aware multi-level encoder is composed of three parallel soft-threshold attention (STAttention) modules.
  • Figure 2: Overview of the proposed FishViT. The components from the left to right are Backbone, Similarity-aware multi-level encoder, Decoder & Head. The last three feature maps extracted by backbone are fed to similarity-aware multi-level encoder with parallel structure, each branch of encoder contains pooling positional encoding (PPE) and soft-threshold attention (STAttention). Finally, the feature map of each branch after multi-level aggregation are fed into decoder & head for detection. Best viewed in color.
  • Figure 3: Detailed workflow of the STAttention. The input feature maps are processed through STAttention after pooling position encoding. The result of $\mathrm{Softmax}(\mathbf{K})^{T}\mathbf{V}$ is used to generate a soft-threshold, and the linear attention result is filtered by the soft-threshold. Specifically, the soft-threshold mechanism consists of two main modules: ABS & GAP and FS. ABS stands for Absolute Value, and GAP represents Global Average Pooling, while the FS module refers to the operations within the dashed box. Finally, $\mathbf{V}$ is added to the filtered result through the shortcut technique, and the added result is used as the final output feature map. This soft-threshold mechanism effectively suppresses background noise, enabling precise identification of the edges of each individual fish within within dense and visually similar groups. Best viewed in color.
  • Figure 4: Visualization of the confidence maps of the Baseline and the proposed FishViT. FishViT can effectively reduce background interference and focus on the detailed representation of the fish to cope with extremely complex situation such as high-speed tumbling, flowing water, and high density. Best viewed in color.
  • Figure 5: Comparison of FishViT results with other SOTA detectors. YOLOv8-M has duplicate detection boxes due to post-processing, while other Transformer-based detectors exhibit missed detections. Our FishViT achieved the best results. Best viewed in color.
  • ...and 2 more figures