Table of Contents
Fetching ...

CA-YOLO: Cross Attention Empowered YOLO for Biomimetic Localization

Zhen Zhang, Qing Zhao, Xiuhe Li, Cheng Wang, Guoqiang Zhu, Yu Zhang, Yining Huo, Hongyi Yu, Yi Zhang

TL;DR

This work tackles the challenge of accurate and robust localization of small, dynamic targets in complex environments by coupling CA-YOLO, a recognition backbone augmented with Multi-Head Self-Attention, a dedicated small-target head, and the CFAM fusion module, with a biomimetic pan-tilt tracking system inspired by the vestibulo-ocular reflex. The CA-YOLO module delivers improved multi-scale detection and small-target performance, while the Bio-Pan-Tilt module provides center-focused, stable tracking through center positioning, stability optimization via a decision boundary, an adaptive control coefficient, and an intelligent recapture strategy. Experimental results on COCO, VisDrone, and custom AGV/UAV datasets show CA-YOLO achieves higher accuracy and remains feasible for real-time deployment, with notable gains in small-target detection and robust tracking under variable speeds. The integrated system demonstrates practical potential for time-sensitive localization tasks in robotics and surveillance, with room for extending to mobile carriers and unknown targets in future work.

Abstract

In modern complex environments, achieving accurate and efficient target localization is essential in numerous fields. However, existing systems often face limitations in both accuracy and the ability to recognize small targets. In this study, we propose a bionic stabilized localization system based on CA-YOLO, designed to enhance both target localization accuracy and small target recognition capabilities. Acting as the "brain" of the system, the target detection algorithm emulates the visual focusing mechanism of animals by integrating bionic modules into the YOLO backbone network. These modules include the introduction of a small target detection head and the development of a Characteristic Fusion Attention Mechanism (CFAM). Furthermore, drawing inspiration from the human Vestibulo-Ocular Reflex (VOR), a bionic pan-tilt tracking control strategy is developed, which incorporates central positioning, stability optimization, adaptive control coefficient adjustment, and an intelligent recapture function. The experimental results show that CA-YOLO outperforms the original model on standard datasets (COCO and VisDrone), with average accuracy metrics improved by 3.94%and 4.90%, respectively.Further time-sensitive target localization experiments validate the effectiveness and practicality of this bionic stabilized localization system.

CA-YOLO: Cross Attention Empowered YOLO for Biomimetic Localization

TL;DR

This work tackles the challenge of accurate and robust localization of small, dynamic targets in complex environments by coupling CA-YOLO, a recognition backbone augmented with Multi-Head Self-Attention, a dedicated small-target head, and the CFAM fusion module, with a biomimetic pan-tilt tracking system inspired by the vestibulo-ocular reflex. The CA-YOLO module delivers improved multi-scale detection and small-target performance, while the Bio-Pan-Tilt module provides center-focused, stable tracking through center positioning, stability optimization via a decision boundary, an adaptive control coefficient, and an intelligent recapture strategy. Experimental results on COCO, VisDrone, and custom AGV/UAV datasets show CA-YOLO achieves higher accuracy and remains feasible for real-time deployment, with notable gains in small-target detection and robust tracking under variable speeds. The integrated system demonstrates practical potential for time-sensitive localization tasks in robotics and surveillance, with room for extending to mobile carriers and unknown targets in future work.

Abstract

In modern complex environments, achieving accurate and efficient target localization is essential in numerous fields. However, existing systems often face limitations in both accuracy and the ability to recognize small targets. In this study, we propose a bionic stabilized localization system based on CA-YOLO, designed to enhance both target localization accuracy and small target recognition capabilities. Acting as the "brain" of the system, the target detection algorithm emulates the visual focusing mechanism of animals by integrating bionic modules into the YOLO backbone network. These modules include the introduction of a small target detection head and the development of a Characteristic Fusion Attention Mechanism (CFAM). Furthermore, drawing inspiration from the human Vestibulo-Ocular Reflex (VOR), a bionic pan-tilt tracking control strategy is developed, which incorporates central positioning, stability optimization, adaptive control coefficient adjustment, and an intelligent recapture function. The experimental results show that CA-YOLO outperforms the original model on standard datasets (COCO and VisDrone), with average accuracy metrics improved by 3.94%and 4.90%, respectively.Further time-sensitive target localization experiments validate the effectiveness and practicality of this bionic stabilized localization system.
Paper Structure (28 sections, 19 equations, 14 figures, 6 tables)

This paper contains 28 sections, 19 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: (a) Showcased the construction of human Vestibular-Ocular Reflex (VOR), providing inspiration for the design of the bionic pan-tilt; (b) presents the overall framework of the bionic pan-tilt system, which includes a high - precision camera, a computer terminal, an STM32 single-chip microcomputer, and a servo pan-tilt. Multi-scale target detection and stable tracking are achieved through the CA-YOLO module.
  • Figure 2: On the basis of YOLO, the CA-YOLO framework improves the performance of multi-scale object detection by placing MHSA after SPPF, adding a small object detection head (xSmall), and replacing some Concat modules with CFAM modules.
  • Figure 3: Schematic diagram of the newly added small target detection head structure in CA-YOLO network. The higher-resolution detection layer retains more details of small targets, enhancing the accuracy of small target detection and reducing the leakage rate.
  • Figure 4: Decomposition diagram of the computational process of the multi-head self-attention mechanism (MHSA). (a) Illusttation of single-head attention computation: Q, K and V vectors are computed by matrix multiplication to compute similarity; (b) Presenting multi-head attention integration, Q, K and V are linearly transformed to generate representations for each head, and each head computes the single-head attention and then splices them together, and finally integrates them through the linear layer to get the MHSA output.
  • Figure 5: A hybrid feature fusion architecture that combines convolution and MHSA mechanism achieves information integration and output through weighted summation and feature concatenation.
  • ...and 9 more figures