Table of Contents
Fetching ...

Octave-YOLO: Cross frequency detection network with octave convolution

Sangjune Shin, Dongkun Shin

TL;DR

Octave-YOLO targets real-time object detection on embedded devices by processing high-resolution inputs without prohibitive compute. It introduces CFPNet, which splits feature maps into high-frequency and low-frequency components and concentrates heavy operations on the low-frequency branch, while preserving detail through frequency-domain fusion. The framework integrates a Frequency Separable Block (FSB), Frequency Separable Self-Attention (FSSA), and depthwise separable downsampling, yielding substantial reductions in parameters and FLOPs with minimal accuracy loss, and up to 1.56x faster high-resolution latency. Evaluated on COCO, Octave-YOLO matches YOLOv8 performance with far lower computational cost, demonstrating strong practicality for embedded real-time detection and complex scenes.

Abstract

Despite the rapid advancement of object detection algorithms, processing high-resolution images on embedded devices remains a significant challenge. Theoretically, the fully convolutional network architecture used in current real-time object detectors can handle all input resolutions. However, the substantial computational demands required to process high-resolution images render them impractical for real-time applications. To address this issue, real-time object detection models typically downsample the input image for inference, leading to a loss of detail and decreased accuracy. In response, we developed Octave-YOLO, designed to process high-resolution images in real-time within the constraints of embedded systems. We achieved this through the introduction of the cross frequency partial network (CFPNet), which divides the input feature map into low-resolution, low-frequency, and high-resolution, high-frequency sections. This configuration enables complex operations such as convolution bottlenecks and self-attention to be conducted exclusively on low-resolution feature maps while simultaneously preserving the details in high-resolution maps. Notably, this approach not only dramatically reduces the computational demands of convolution tasks but also allows for the integration of attention modules, which are typically challenging to implement in real-time applications, with minimal additional cost. Additionally, we have incorporated depthwise separable convolution into the core building blocks and downsampling layers to further decrease latency. Experimental results have shown that Octave-YOLO matches the performance of YOLOv8 while significantly reducing computational demands. For example, in 1080x1080 resolution, Octave-YOLO-N is 1.56 times faster than YOLOv8, achieving nearly the same accuracy on the COCO dataset with approximately 40 percent fewer parameters and FLOPs.

Octave-YOLO: Cross frequency detection network with octave convolution

TL;DR

Octave-YOLO targets real-time object detection on embedded devices by processing high-resolution inputs without prohibitive compute. It introduces CFPNet, which splits feature maps into high-frequency and low-frequency components and concentrates heavy operations on the low-frequency branch, while preserving detail through frequency-domain fusion. The framework integrates a Frequency Separable Block (FSB), Frequency Separable Self-Attention (FSSA), and depthwise separable downsampling, yielding substantial reductions in parameters and FLOPs with minimal accuracy loss, and up to 1.56x faster high-resolution latency. Evaluated on COCO, Octave-YOLO matches YOLOv8 performance with far lower computational cost, demonstrating strong practicality for embedded real-time detection and complex scenes.

Abstract

Despite the rapid advancement of object detection algorithms, processing high-resolution images on embedded devices remains a significant challenge. Theoretically, the fully convolutional network architecture used in current real-time object detectors can handle all input resolutions. However, the substantial computational demands required to process high-resolution images render them impractical for real-time applications. To address this issue, real-time object detection models typically downsample the input image for inference, leading to a loss of detail and decreased accuracy. In response, we developed Octave-YOLO, designed to process high-resolution images in real-time within the constraints of embedded systems. We achieved this through the introduction of the cross frequency partial network (CFPNet), which divides the input feature map into low-resolution, low-frequency, and high-resolution, high-frequency sections. This configuration enables complex operations such as convolution bottlenecks and self-attention to be conducted exclusively on low-resolution feature maps while simultaneously preserving the details in high-resolution maps. Notably, this approach not only dramatically reduces the computational demands of convolution tasks but also allows for the integration of attention modules, which are typically challenging to implement in real-time applications, with minimal additional cost. Additionally, we have incorporated depthwise separable convolution into the core building blocks and downsampling layers to further decrease latency. Experimental results have shown that Octave-YOLO matches the performance of YOLOv8 while significantly reducing computational demands. For example, in 1080x1080 resolution, Octave-YOLO-N is 1.56 times faster than YOLOv8, achieving nearly the same accuracy on the COCO dataset with approximately 40 percent fewer parameters and FLOPs.
Paper Structure (21 sections, 5 figures, 5 tables)

This paper contains 21 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparisons with others in terms of FLOPs vs AP (left) and model size vs AP (right) trade-offs.
  • Figure 2: The detailed structure of the octave convolution. The two green paths represent the updating of information for the high and low frequency feature maps, respectively, while the two red paths represent the mutual exchange of information between the two different frequencies.
  • Figure 3: Comparison between the original cross stage partial network (CSPNet) and our proposed cross frequency partial network (CFPNet).
  • Figure 4: (a) The original C2f building block used in YOLOv8. (b) The frequency serparable block (FSB). (c) The frequency separable self-attention module (FSSA).
  • Figure 5: Comparing image inference and visualization between YOLOv8-S and Octave-YOLO-S.