Towards Efficient Multi-Scale Deformable Attention on NPU
Chenghuan Huang, Zhigeng Xu, Chong Sun, Chen Li, Ziyang Ma
TL;DR
This work tackles the efficiency bottlenecks of Multi-Scale Deformable Attention (MSDA) on Ascend NPUs by introducing xMSDA, a co-design that harmonizes MSDA algorithms with NPU architecture. Through detailed hardware profiling and targeted optimizations—including type-unaligned gather handling, adaptive vectorization, padding-based alignment, and contention-aware scheduling—the approach yields large speedups in forward, backward, and end-to-end training. Ablation studies validate the contribution of each optimization, and results show up to $5.9\times$, $8.9\times$, and $7.3\times$ improvements over a PyTorch grid-sample baseline, with additional gains over vendor libraries. The methodology presents a practical blueprint for deploying flexible attention operators efficiently on domain-specific accelerators.
Abstract
Multi-scale deformable attention (MSDA) is a flexible and powerful feature extraction mechanism for visual tasks, but its random-access grid sampling strategy poses significant optimization challenges, especially on domain-specific accelerators such as NPUs. In this work, we present a co-design approach that systematically rethinks memory access and computation strategies for MSDA on the Ascend NPU architecture. With this co-design approach, our implementation supports both efficient forward and backward computation, is fully adapted for training workloads, and incorporates a suite of hardware-aware optimizations. Extensive experiments show that our solution achieves up to $5.9\times$ (forward), $8.9\times$ (backward), and $7.3\times$ (end-to-end training) speedup over the grid sample-based baseline, and $1.9\times$, $2.4\times$, and $2.0\times$ acceleration over the latest vendor library, respectively.
