Table of Contents
Fetching ...

ASF-YOLO: A Novel YOLO Model with Attentional Scale Sequence Fusion for Cell Instance Segmentation

Ming Kang, Chee-Ming Ting, Fung Fung Ting, Raphaël C. -W. Phan

TL;DR

A novel Attentional Scale Sequence Fusion based You Only Look Once (YOLO) framework (ASF-YOLO) which combines spatial and scale features for accurate and fast cell instance segmentation and introduces a Channel and Position Attention Mechanism (CPAM) to integrate both the SSFF and TPE modules.

Abstract

We propose a novel Attentional Scale Sequence Fusion based You Only Look Once (YOLO) framework (ASF-YOLO) which combines spatial and scale features for accurate and fast cell instance segmentation. Built on the YOLO segmentation framework, we employ the Scale Sequence Feature Fusion (SSFF) module to enhance the multi-scale information extraction capability of the network, and the Triple Feature Encoder (TFE) module to fuse feature maps of different scales to increase detailed information. We further introduce a Channel and Position Attention Mechanism (CPAM) to integrate both the SSFF and TPE modules, which focus on informative channels and spatial position-related small objects for improved detection and segmentation performance. Experimental validations on two cell datasets show remarkable segmentation accuracy and speed of the proposed ASF-YOLO model. It achieves a box mAP of 0.91, mask mAP of 0.887, and an inference speed of 47.3 FPS on the 2018 Data Science Bowl dataset, outperforming the state-of-the-art methods. The source code is available at https://github.com/mkang315/ASF-YOLO.

ASF-YOLO: A Novel YOLO Model with Attentional Scale Sequence Fusion for Cell Instance Segmentation

TL;DR

A novel Attentional Scale Sequence Fusion based You Only Look Once (YOLO) framework (ASF-YOLO) which combines spatial and scale features for accurate and fast cell instance segmentation and introduces a Channel and Position Attention Mechanism (CPAM) to integrate both the SSFF and TPE modules.

Abstract

We propose a novel Attentional Scale Sequence Fusion based You Only Look Once (YOLO) framework (ASF-YOLO) which combines spatial and scale features for accurate and fast cell instance segmentation. Built on the YOLO segmentation framework, we employ the Scale Sequence Feature Fusion (SSFF) module to enhance the multi-scale information extraction capability of the network, and the Triple Feature Encoder (TFE) module to fuse feature maps of different scales to increase detailed information. We further introduce a Channel and Position Attention Mechanism (CPAM) to integrate both the SSFF and TPE modules, which focus on informative channels and spatial position-related small objects for improved detection and segmentation performance. Experimental validations on two cell datasets show remarkable segmentation accuracy and speed of the proposed ASF-YOLO model. It achieves a box mAP of 0.91, mask mAP of 0.887, and an inference speed of 47.3 FPS on the 2018 Data Science Bowl dataset, outperforming the state-of-the-art methods. The source code is available at https://github.com/mkang315/ASF-YOLO.
Paper Structure (20 sections, 10 equations, 7 figures, 5 tables)

This paper contains 20 sections, 10 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Different cell images (left) and their feature maps (right).
  • Figure 2: An abridged general view of the framework of YOLOv5 v7.0 and YOLOv8 for segmentation task. P1, P2, P3, P4, P5 represent different levels of features output by the backbone. The head part clips the segmentation masks to bind them inside each of the detected bounding boxes, which ensures that the segmentation masks do not flow out of the bounding boxes. The neck part is intentionally abridged due to the different structures between YOLOv5 and YOLOv8.
  • Figure 3: The overview of the proposed ASF-YOLO model. The framework is mainly comprised of the Scale Sequence Feature Fusion (SSFF) module, the Triple Feature Encoder (TFE) module, and the Channel and Position Attention Model (CPAM) based on the CSPDarkNet backbone and the YOLO head. CSP and Concat modules come from YOLOv5.
  • Figure 4: The structure of TFE module.$C$ represents the number of channels, and $S$ represents the feature map size. Each triple feature encoder module uses three feature maps of different sizes as input.
  • Figure 5: The structure of CPAM module. It contains channel and position attention networks.$w$ and $h$ represent width and height, respectively. $\otimes$ denotes the operation of the Hadamard product.
  • ...and 2 more figures