Table of Contents
Fetching ...

PKINet-v2: Towards Powerful and Efficient Poly-Kernel Remote Sensing Object Detection

Xinhao Cai, Liulei Li, Gensheng Pei, Zeren Sun, Yazhou Yao, Wenguan Wang

Abstract

Object detection in remote sensing images (RSIs) is challenged by the coexistence of geometric and spatial complexity: targets may appear with diverse aspect ratios, while spanning a wide range of object sizes under varied contexts. Existing RSI backbones address the two challenges separately, either by adopting anisotropic strip kernels to model slender targets or by using isotropic large kernels to capture broader context. However, such isolated treatments lead to complementary drawbacks: the strip-only design can disrupt spatial coherence for regular-shaped objects and weaken tiny details, whereas isotropic large kernels often introduce severe background noise and geometric mismatch for slender structures. In this paper, we extend PKINet, and present a powerful and efficient backbone that jointly handles both challenges within a unified paradigm named Poly Kernel Inception Network v2 (PKINet-v2). PKINet-v2 synergizes anisotropic axial-strip convolutions with isotropic square kernels and builds a multi-scope receptive field, preserving fine-grained local textures while progressively aggregating long-range context across scales. To enable efficient deployment, we further introduce a Heterogeneous Kernel Re-parameterization (HKR) Strategy that fuses all heterogeneous branches into a single depth-wise convolution for inference, eliminating fragmented kernel launches without accuracy loss. Extensive experiments on four widely-used benchmarks, including DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R, demonstrate that PKINet-v2 achieves state-of-the-art accuracy while delivering a $\textbf{3.9}\times$ FPS acceleration compared to PKINet-v1, surpassing previous remote sensing backbones in both effectiveness and efficiency.

PKINet-v2: Towards Powerful and Efficient Poly-Kernel Remote Sensing Object Detection

Abstract

Object detection in remote sensing images (RSIs) is challenged by the coexistence of geometric and spatial complexity: targets may appear with diverse aspect ratios, while spanning a wide range of object sizes under varied contexts. Existing RSI backbones address the two challenges separately, either by adopting anisotropic strip kernels to model slender targets or by using isotropic large kernels to capture broader context. However, such isolated treatments lead to complementary drawbacks: the strip-only design can disrupt spatial coherence for regular-shaped objects and weaken tiny details, whereas isotropic large kernels often introduce severe background noise and geometric mismatch for slender structures. In this paper, we extend PKINet, and present a powerful and efficient backbone that jointly handles both challenges within a unified paradigm named Poly Kernel Inception Network v2 (PKINet-v2). PKINet-v2 synergizes anisotropic axial-strip convolutions with isotropic square kernels and builds a multi-scope receptive field, preserving fine-grained local textures while progressively aggregating long-range context across scales. To enable efficient deployment, we further introduce a Heterogeneous Kernel Re-parameterization (HKR) Strategy that fuses all heterogeneous branches into a single depth-wise convolution for inference, eliminating fragmented kernel launches without accuracy loss. Extensive experiments on four widely-used benchmarks, including DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R, demonstrate that PKINet-v2 achieves state-of-the-art accuracy while delivering a FPS acceleration compared to PKINet-v1, surpassing previous remote sensing backbones in both effectiveness and efficiency.
Paper Structure (19 sections, 10 equations, 7 figures, 10 tables, 3 algorithms)

This paper contains 19 sections, 10 equations, 7 figures, 10 tables, 3 algorithms.

Figures (7)

  • Figure 1: Performance of PKINet-v2 on DOTA-v1 datasetxia2018dota. PKINet-v2 consistently improves mAP while substantially boosting inference efficiency over PKINet-v1 cai2024poly, across representative oriented detectors.
  • Figure 2: PKINet-v2 overview. PKINet-v2 consists of four stages. Each (a)Stage $l$ contains a Patch Embedding layer followed by a sequence of $N_l$ PKINet-v2 Blocks. Each (b) PKINet-v2 Block is composed of a PKS sub-block and an FFN sub-block. The (c) PKS Block mainly comprises a PKS Module and two Fully Connected (FC) layers. (d) PKS Module (§\ref{['sec:PKS']}) builds a multi-scope spatial attention map by aggregating heterogeneous branches with wide-range receptive field and fusing them with a $1\times1$ convolution. Here, $n\!=\!0,\dots,N_l\!-\!1$ indicates that the PKS Module/Block is located in the $n$-th PKINet-v2 Block of the $l$-th stage. Please refer to §\ref{['sec:method']} for more details.
  • Figure 3: Receptive field construction and HKR Strategy.(Left) The multi-branch PKS Module (§\ref{['sec:PKS']}) aggregates heterogeneous depth-wise branches to form a hierarchically densified receptive field with a broad receptive-field range, where global full-span coverage is progressively enriched toward the center. (Right) Heterogeneous Kernel Re-parameterization (HKR, §\ref{['sec:HKR']}) algebraically fuses Conv-BN and merges all branches into one equivalent $K_{\max}\!\times\!K_{\max}$ depth-wise convolution, preserving identical outputs while significantly improving inference efficiency. Please refer to §\ref{['sec:method']} for details.
  • Figure 4: Visual results on DOTA-1.0 datasetxia2018dota. (Top) Geometric complexity. (Bottom) Spatial complexity. PKINet-v2 delivers more robust detection than other methods cai2024polyyuan2025stripLi_2023_ICCV under both challenges. Please refer to §\ref{['sec:qua_result']} for more details.
  • Figure S1: More qualitative comparisons on DOTAxia2018dota on Oriented R-CNN xie2021oriented with PKINet-v1 cai2024poly. See §\ref{['sec:more_qua_results']} for details.
  • ...and 2 more figures