Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

Yuwen Xiong; Zhiqi Li; Yuntao Chen; Feng Wang; Xizhou Zhu; Jiapeng Luo; Wenhai Wang; Tong Lu; Hongsheng Li; Yu Qiao; Lewei Lu; Jie Zhou; Jifeng Dai

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, Jifeng Dai

TL;DR

DCNv4 introduces an efficient deformable convolution operator by removing softmax normalization in spatial aggregation and optimizing memory access, achieving more than 3× forward speed and significantly faster convergence than DCNv3. The approach blends ConvNet inductive bias with dynamic, input-dependent sampling, and its optimizations address memory-bound GPU execution to realize practical speedups. Empirical results demonstrate DCNv4’s strong performance across image classification, instance/semantic segmentation, 3D detection, and even diffusion-model generation, with the FlashInternImage backbone attaining substantial throughput gains. The work argues for DCNv4 as a foundational, universal vision operator and provides broad demonstrations of its applicability across backbones and generative models, along with releasing implementation details for the community.

Abstract

We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements: 1. removing softmax normalization in spatial aggregation to enhance its dynamic property and expressive power and 2. optimizing memory access to minimize redundant operations for speedup. These improvements result in a significantly faster convergence compared to DCNv3 and a substantial increase in processing speed, with DCNv4 achieving more than three times the forward speed. DCNv4 demonstrates exceptional performance across various tasks, including image classification, instance and semantic segmentation, and notably, image generation. When integrated into generative models like U-Net in the latent diffusion model, DCNv4 outperforms its baseline, underscoring its possibility to enhance generative models. In practical applications, replacing DCNv3 with DCNv4 in the InternImage model to create FlashInternImage results in up to 80% speed increase and further performance improvement without further modifications. The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models.

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

TL;DR

Abstract

Paper Structure (40 sections, 2 equations, 4 figures, 15 tables)

This paper contains 40 sections, 2 equations, 4 figures, 15 tables.

Introduction
Related Work
Core operators in vision models:
Memory access cost (MAC) in vision backbones:
Method
Rethinking the Dynamic Property in Deformable Convolution
Revisiting DCNv3:
Softmax normalization:
Enhancing dynamic property:
Speeding up DCN
Theoretical analysis of GPU efficiency
Eliminating redundant workload:
Eliminating redundant memory instructions:
Micro design in DCN module:
Experiments
...and 25 more sections

Figures (4)

Figure 1: (a) We show relative runtime with DCNv3 as the baseline. DCNv4 shows significant speedup over DCNv3, and surpasses other common vision operators. (b) With the same network architecture, DCNv4 converges faster than other operators, while DCNv3 falls behind in the initial training phase.
Figure 1: ImageNet $256\times 256$ generation results of U-Net + DCNv4 latent diffusion model.
Figure 2: Comparisons of core operators in spatial aggregation for query pixels on different locations within the same channel. (a) Attention and (b) DCNv3 use bounded (range from $0\sim 1$) dynamic weights to aggregate spatial features, while the window (sampling point set) for attention is the same, and DCNv3 uses a dedicated window for each location. (c) Convolution has a more flexible unbounded value range for aggregation weights and uses a dedicated sliding window for each location, but the window shape and aggregation weights are input-independent. (d) DCNv4 combines their advantages, using an adaptive aggregation window and dynamic aggregation weights with an unbounded value range.
Figure 3: Illustration of our optimization. In DCNv4, we use one thread to process multiple channels in the same group that shares sampling offset and aggregation weights. Workloads like memory reading and bilinear interpolation coefficient computation can be reduced, and multiple memory access instructions can be merged.

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

TL;DR

Abstract

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

Authors

TL;DR

Abstract

Table of Contents

Figures (4)