Table of Contents
Fetching ...

Low-Rank Compression for IMC Arrays

Kang Eun Jeon, Johnny Rhe, Jong Hwan Ko

TL;DR

This work tackles the inefficiencies of pruning-based model compression on in-memory computing (IMC) arrays by proposing low-rank weight compression integrated with SDK mapping and group-based decomposition. The core ideas are formalized through Theorem 1, which shows that group low-rank reconstruction error $\varepsilon_g$ is bounded above by the traditional low-rank error $\varepsilon$, and Theorem 2, which expresses the SDK-aligned low-rank form as $\mathcal{D}(\operatorname{SDK}(\mathbf{W}))=(\mathbf{I}_N \otimes \mathbf{L})\operatorname{SDK}(\mathbf{R})$, enabling parallelism and efficient utilization of IMC arrays. Empirically, on ResNet-20 and Wide-ResNet-16-4 with CIFAR-10/100, the proposed method achieves up to 2.5x speedup and about +20.9% accuracy gain over pruning, with 71–80% energy savings relative to pruning and im2col baselines, while outperforming quantized approaches in several configurations. Overall, the approach offers a hardware-friendly, high-accuracy alternative to pruning for IMC architectures, reducing reliance on peripheral circuitry and improving practical deployment potential.

Abstract

In this study, we address the challenge of low-rank model compression in the context of in-memory computing (IMC) architectures. Traditional pruning approaches, while effective in model size reduction, necessitate additional peripheral circuitry to manage complex dataflows and mitigate dislocation issues, leading to increased area and energy overheads. To circumvent these drawbacks, we propose leveraging low-rank compression techniques, which, unlike pruning, streamline the dataflow and seamlessly integrate with IMC architectures. However, low-rank compression presents its own set of challenges, namely i) suboptimal IMC array utilization and ii) compromised accuracy. To address these issues, we introduce a novel approach i) employing shift and duplicate kernel (SDK) mapping technique, which exploits idle IMC columns for parallel processing, and ii) group low-rank convolution, which mitigates the information imbalance in the decomposed matrices. Our experimental results demonstrate that our proposed method achieves up to 2.5x speedup or +20.9% accuracy boost over existing pruning techniques.

Low-Rank Compression for IMC Arrays

TL;DR

This work tackles the inefficiencies of pruning-based model compression on in-memory computing (IMC) arrays by proposing low-rank weight compression integrated with SDK mapping and group-based decomposition. The core ideas are formalized through Theorem 1, which shows that group low-rank reconstruction error is bounded above by the traditional low-rank error , and Theorem 2, which expresses the SDK-aligned low-rank form as , enabling parallelism and efficient utilization of IMC arrays. Empirically, on ResNet-20 and Wide-ResNet-16-4 with CIFAR-10/100, the proposed method achieves up to 2.5x speedup and about +20.9% accuracy gain over pruning, with 71–80% energy savings relative to pruning and im2col baselines, while outperforming quantized approaches in several configurations. Overall, the approach offers a hardware-friendly, high-accuracy alternative to pruning for IMC architectures, reducing reliance on peripheral circuitry and improving practical deployment potential.

Abstract

In this study, we address the challenge of low-rank model compression in the context of in-memory computing (IMC) architectures. Traditional pruning approaches, while effective in model size reduction, necessitate additional peripheral circuitry to manage complex dataflows and mitigate dislocation issues, leading to increased area and energy overheads. To circumvent these drawbacks, we propose leveraging low-rank compression techniques, which, unlike pruning, streamline the dataflow and seamlessly integrate with IMC architectures. However, low-rank compression presents its own set of challenges, namely i) suboptimal IMC array utilization and ii) compromised accuracy. To address these issues, we introduce a novel approach i) employing shift and duplicate kernel (SDK) mapping technique, which exploits idle IMC columns for parallel processing, and ii) group low-rank convolution, which mitigates the information imbalance in the decomposed matrices. Our experimental results demonstrate that our proposed method achieves up to 2.5x speedup or +20.9% accuracy boost over existing pruning techniques.

Paper Structure

This paper contains 6 sections, 10 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Conventional model compression methods for IMC arrays and the proposed low-rank compression method.
  • Figure 2: Convolutional weight mapping methods.
  • Figure 3: Low-rank matrix decomposition.
  • Figure 4: Motivation of our research.
  • Figure 5: Overview of the proposed techniques for low-rank compression on IMC arrays.
  • ...and 4 more figures