Table of Contents
Fetching ...

UniHead: Unifying Multi-Perception for Detection Heads

Hantao Zhou, Rui Yang, Yachao Zhang, Haoran Duan, Yawen Huang, Runze Hu, Xiu Li, Yefeng Zheng

TL;DR

This work targets a core limitation of standard detection heads: the lack of unified perception across deformation, global context, and cross-task alignment. It introduces UniHead, a plug-and-play head composed of three modules—Deformation Perception (via deformable convolution), Dual-axial Aggregation Transformer (DAT) for global perception, and Cross-task Interaction Transformer (CIT) for cross-task perception. Empirical results on MS-COCO and VOC demonstrate consistent AP/mAP gains across diverse detectors and backbones, including strong improvements with Swin- bases, while maintaining efficiency. The approach offers a practical path to more accurate and consistent object detection and segmentation by tightly coupling key perceptual capabilities within a single head.

Abstract

The detection head constitutes a pivotal component within object detectors, tasked with executing both classification and localization functions. Regrettably, the commonly used parallel head often lacks omni perceptual capabilities, such as deformation perception, global perception and cross-task perception. Despite numerous methods attempting to enhance these abilities from a single aspect, achieving a comprehensive and unified solution remains a significant challenge. In response to this challenge, we develop an innovative detection head, termed UniHead, to unify three perceptual abilities simultaneously. More precisely, our approach (1) introduces deformation perception, enabling the model to adaptively sample object features; (2) proposes a Dual-axial Aggregation Transformer (DAT) to adeptly model long-range dependencies, thereby achieving global perception; and (3) devises a Cross-task Interaction Transformer (CIT) that facilitates interaction between the classification and localization branches, thus aligning the two tasks. As a plug-and-play method, the proposed UniHead can be conveniently integrated with existing detectors. Extensive experiments on the COCO dataset demonstrate that our UniHead can bring significant improvements to many detectors. For instance, the UniHead can obtain +2.7 AP gains in RetinaNet, +2.9 AP gains in FreeAnchor, and +2.1 AP gains in GFL. The code is available at https://github.com/zht8506/UniHead.

UniHead: Unifying Multi-Perception for Detection Heads

TL;DR

This work targets a core limitation of standard detection heads: the lack of unified perception across deformation, global context, and cross-task alignment. It introduces UniHead, a plug-and-play head composed of three modules—Deformation Perception (via deformable convolution), Dual-axial Aggregation Transformer (DAT) for global perception, and Cross-task Interaction Transformer (CIT) for cross-task perception. Empirical results on MS-COCO and VOC demonstrate consistent AP/mAP gains across diverse detectors and backbones, including strong improvements with Swin- bases, while maintaining efficiency. The approach offers a practical path to more accurate and consistent object detection and segmentation by tightly coupling key perceptual capabilities within a single head.

Abstract

The detection head constitutes a pivotal component within object detectors, tasked with executing both classification and localization functions. Regrettably, the commonly used parallel head often lacks omni perceptual capabilities, such as deformation perception, global perception and cross-task perception. Despite numerous methods attempting to enhance these abilities from a single aspect, achieving a comprehensive and unified solution remains a significant challenge. In response to this challenge, we develop an innovative detection head, termed UniHead, to unify three perceptual abilities simultaneously. More precisely, our approach (1) introduces deformation perception, enabling the model to adaptively sample object features; (2) proposes a Dual-axial Aggregation Transformer (DAT) to adeptly model long-range dependencies, thereby achieving global perception; and (3) devises a Cross-task Interaction Transformer (CIT) that facilitates interaction between the classification and localization branches, thus aligning the two tasks. As a plug-and-play method, the proposed UniHead can be conveniently integrated with existing detectors. Extensive experiments on the COCO dataset demonstrate that our UniHead can bring significant improvements to many detectors. For instance, the UniHead can obtain +2.7 AP gains in RetinaNet, +2.9 AP gains in FreeAnchor, and +2.1 AP gains in GFL. The code is available at https://github.com/zht8506/UniHead.
Paper Structure (17 sections, 7 equations, 6 figures, 12 tables)

This paper contains 17 sections, 7 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Overview of the parallel head (a) and our UniHead (b). UniHead integrates deformation perception (DP), global perception (GP), and cross-task perception (CTP) in the detection head.
  • Figure 2: Our UniHead significantly improves the performance of different detectors, including anchor-based (RetinaNet retinanet), anchor-free center-based (FCOS fcos), anchor-free keypoint-based (Reppoints yang2019reppoints) and strong baseline (FreeAnchor zhang2019freeanchor, ATSS atss, GFL gfl).
  • Figure 3: Illustration of our Dual-axial Aggregation Transformer (DAT). ATT and CAB represent attention and Cross-axis Aggregation Block, respectively.
  • Figure 4: Illustration of the proposed Cross-task Interaction Transformer. PE donates the positional encoding process.
  • Figure 5: Visualization of detection results of RetinaNet with parallel head and RetinaNet with UniHead. With our UniHead, the model is capable of more effectively detecting objects with diverse deformations and scales, and it can produce high-confidence precise bounding boxes. The major difference is marked by the orange circle. Zoom in for a better view.
  • ...and 1 more figures