Table of Contents
Fetching ...

CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

Abstract

Recent research on vision backbone architectures has predominantly focused on optimizing efficiency for hardware platforms with high parallel processing capabilities. This category increasingly includes embedded systems such as mobile phones and embedded AI accelerator modules. In contrast, CPUs do not have the possibility to parallelize operations in the same manner, wherefore models benefit from a specific design philosophy that balances amount of operations (MACs) and hardware-efficient execution by having high MACs per second (MACpS). In pursuit of this, we investigate two modifications to standard convolutions, aimed at reducing computational cost: grouping convolutions and reducing kernel sizes. While both adaptations substantially decrease the total number of MACs required for inference, sustaining low latency necessitates preserving hardware-efficiency. Our experiments across diverse CPU devices confirm that these adaptations successfully retain high hardware-efficiency on CPUs. Based on these insights, we introduce CPUBone, a new family of vision backbone models optimized for CPU-based inference. CPUBone achieves state-of-the-art Speed-Accuracy Trade-offs (SATs) across a wide range of CPU devices and effectively transfers its efficiency to downstream tasks such as object detection and semantic segmentation. Models and code are available at https://github.com/altair199797/CPUBone.

CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities

Abstract

Recent research on vision backbone architectures has predominantly focused on optimizing efficiency for hardware platforms with high parallel processing capabilities. This category increasingly includes embedded systems such as mobile phones and embedded AI accelerator modules. In contrast, CPUs do not have the possibility to parallelize operations in the same manner, wherefore models benefit from a specific design philosophy that balances amount of operations (MACs) and hardware-efficient execution by having high MACs per second (MACpS). In pursuit of this, we investigate two modifications to standard convolutions, aimed at reducing computational cost: grouping convolutions and reducing kernel sizes. While both adaptations substantially decrease the total number of MACs required for inference, sustaining low latency necessitates preserving hardware-efficiency. Our experiments across diverse CPU devices confirm that these adaptations successfully retain high hardware-efficiency on CPUs. Based on these insights, we introduce CPUBone, a new family of vision backbone models optimized for CPU-based inference. CPUBone achieves state-of-the-art Speed-Accuracy Trade-offs (SATs) across a wide range of CPU devices and effectively transfers its efficiency to downstream tasks such as object detection and semantic segmentation. Models and code are available at https://github.com/altair199797/CPUBone.

Paper Structure

This paper contains 46 sections, 13 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Comparison of ARM CPU latency and ImageNet top-1 accuracy of recent image classification architectures, including our proposed CPUBone models. Markers refer to different model sizes within each architecture family. CPUBone consistently achieves lower latency at comparable accuracy, highlighting its CPU-optimized design.
  • Figure 2: Design of MBConv variants, including the two proposed Grouped MBConv (GrMBConv) and Grouped Fused MBConv (GrFuMBConv). In both grouped variants, the first convolution is configured with $groups=2$. All variants expand the channel dimension in the first convolution by the expansion factor (set to four in this figure) and reduce it again by the expansion factor in the last convolution.
  • Figure 3: CPUBone macro architecture design. Attention refers to LowFormer Attention nottebaum2025lowformer.
  • Figure 4: Results on semantic segmentation, using Semantic FPN semanticfpn, trained and evaluated on ADE20K ade20k. Backbone latency is measured under resolution $512\times512$. Models are grouped by mIoU. Best value in each group is made bold.