Table of Contents
Fetching ...

3A-YOLO: New Real-Time Object Detectors with Triple Discriminative Awareness and Coordinated Representations

Xuecheng Wu, Junxiao Xue, Liangyu Fu, Jiayu Nie, Danlei Huang, Xinyi Yin

TL;DR

The paper addresses the gap in real-time object detection where YOLO heads lack a unified hierarchical attention framework. It introduces 3A-YOLO, featuring a TDA-YOLO Module that provides triple discriminative awareness (scale, spatial, and task) and leverages Coordinate Attention to learn coordinated inter-channel and positional representations, complemented by neck improvements and training tricks. Three scaled variants (X, Tiny, Nano) are proposed to fit diverse deployments, and extensive experiments on COCO and VOC demonstrate improved speed-accuracy trade-offs and robust gains from ablations. The work offers practical implications for deploying faster, more accurate real-time detectors and suggests extending the approach to newer backbones such as YOLOv7 in future work.

Abstract

Recent research on real-time object detectors (e.g., YOLO series) has demonstrated the effectiveness of attention mechanisms for elevating model performance. Nevertheless, existing methods neglect to unifiedly deploy hierarchical attention mechanisms to construct a more discriminative YOLO head which is enriched with more useful intermediate features. To tackle this gap, this work aims to leverage multiple attention mechanisms to hierarchically enhance the triple discriminative awareness of the YOLO detection head and complementarily learn the coordinated intermediate representations, resulting in a new series detectors denoted 3A-YOLO. Specifically, we first propose a new head denoted TDA-YOLO Module, which unifiedly enhance the representations learning of scale-awareness, spatial-awareness, and task-awareness. Secondly, we steer the intermediate features to coordinately learn the inter-channel relationships and precise positional information. Finally, we perform neck network improvements followed by introducing various tricks to boost the adaptability of 3A-YOLO. Extensive experiments across COCO and VOC benchmarks indicate the effectiveness of our detectors.

3A-YOLO: New Real-Time Object Detectors with Triple Discriminative Awareness and Coordinated Representations

TL;DR

The paper addresses the gap in real-time object detection where YOLO heads lack a unified hierarchical attention framework. It introduces 3A-YOLO, featuring a TDA-YOLO Module that provides triple discriminative awareness (scale, spatial, and task) and leverages Coordinate Attention to learn coordinated inter-channel and positional representations, complemented by neck improvements and training tricks. Three scaled variants (X, Tiny, Nano) are proposed to fit diverse deployments, and extensive experiments on COCO and VOC demonstrate improved speed-accuracy trade-offs and robust gains from ablations. The work offers practical implications for deploying faster, more accurate real-time detectors and suggests extending the approach to newer backbones such as YOLOv7 in future work.

Abstract

Recent research on real-time object detectors (e.g., YOLO series) has demonstrated the effectiveness of attention mechanisms for elevating model performance. Nevertheless, existing methods neglect to unifiedly deploy hierarchical attention mechanisms to construct a more discriminative YOLO head which is enriched with more useful intermediate features. To tackle this gap, this work aims to leverage multiple attention mechanisms to hierarchically enhance the triple discriminative awareness of the YOLO detection head and complementarily learn the coordinated intermediate representations, resulting in a new series detectors denoted 3A-YOLO. Specifically, we first propose a new head denoted TDA-YOLO Module, which unifiedly enhance the representations learning of scale-awareness, spatial-awareness, and task-awareness. Secondly, we steer the intermediate features to coordinately learn the inter-channel relationships and precise positional information. Finally, we perform neck network improvements followed by introducing various tricks to boost the adaptability of 3A-YOLO. Extensive experiments across COCO and VOC benchmarks indicate the effectiveness of our detectors.

Paper Structure

This paper contains 13 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Speed and accuracy trade-off of our 3A-YOLO series detectors and representive CNNs-based real-time detectors on COCO dataset. For example, our 3A-YOLO with 608 $\times$ 608 resolution increases YOLOv4 b1 by 6.2% AP and maintains a competitive speed of 60.1 FPS.
  • Figure 2: The illustration of our 3A-YOLO. SPP is the Spatial Pyramid Pooling Layer b41. $C_i$ and $P_i$ ($i = \{3,4,5\}$) all denote the intermediate features.
  • Figure 3: The illustration of our proposed TDA-YOLO Module.
  • Figure 4: The overall structure of our 3A-YOLO-Tiny.
  • Figure 5: Feature map visualizations coupled with the predictions of 3A-YOLO.