Table of Contents
Fetching ...

Yet another Improvement of Plantard Arithmetic for Faster Kyber on Low-end 32-bit IoT Devices

Junhao Huang, Haosong Zhao, Jipeng Zhang, Wangchen Dai, Lu Zhou, Ray C. C. Cheung, Cetin Kaya Koc, Donglong Chen

TL;DR

The paper addresses the challenge of efficiently implementing Kyber on memory-constrained 32-bit IoT devices. It advances Plantard arithmetic by enlarging its input range and tailoring it to Kyber's modulus, then couples these arithmetic improvements with ISA-specific optimizations for Cortex-M3 and RISC-V to minimize reductions and memory usage in NTT/INTT. The results show significant speedups (NTT/INTT) and substantial stack-memory reductions, enabling fast Kyber deployments on devices like Cortex-M3 and 16KiB RAM platforms, effectively setting new speed records. These contributions extend the practicality of PQC in edge environments and offer a pathway to broader adoption of Kyber in IoT contexts, with potential applicability to other 16-bit modulus LBC schemes.

Abstract

This paper presents another improved version of Plantard arithmetic that could speed up Kyber implementations on two low-end 32-bit IoT platforms (ARM Cortex-M3 and RISC-V) without SIMD extensions. Specifically, we further enlarge the input range of the Plantard arithmetic without modifying its computation steps. After tailoring the Plantard arithmetic for Kyber's modulus, we show that the input range of the Plantard multiplication by a constant is at least 2.14 times larger than the original design in TCHES2022. Then, two optimization techniques for efficient Plantard arithmetic on Cortex-M3 and RISC-V are presented. We show that the Plantard arithmetic supersedes both Montgomery and Barrett arithmetic on low-end 32-bit platforms. With the enlarged input range and the efficient implementation of the Plantard arithmetic on these platforms, we propose various optimization strategies for NTT/INTT. We minimize or entirely eliminate the modular reduction of coefficients in NTT/INTT by taking advantage of the larger input range of the proposed Plantard arithmetic on low-end 32-bit platforms. Furthermore, we propose two memory optimization strategies that reduce 23.50% to 28.31% stack usage for the speed-version Kyber implementation when compared to its counterpart on Cortex-M4. The proposed optimizations make the speed-version implementation more feasible on low-end IoT devices. Thanks to the aforementioned optimizations, our NTT/INTT implementation shows considerable speedups compared to the state-of-the-art work. Overall, we demonstrate the applicability of the speed-version Kyber implementation on memory-constrained IoT platforms and set new speed records for Kyber on these platforms.

Yet another Improvement of Plantard Arithmetic for Faster Kyber on Low-end 32-bit IoT Devices

TL;DR

The paper addresses the challenge of efficiently implementing Kyber on memory-constrained 32-bit IoT devices. It advances Plantard arithmetic by enlarging its input range and tailoring it to Kyber's modulus, then couples these arithmetic improvements with ISA-specific optimizations for Cortex-M3 and RISC-V to minimize reductions and memory usage in NTT/INTT. The results show significant speedups (NTT/INTT) and substantial stack-memory reductions, enabling fast Kyber deployments on devices like Cortex-M3 and 16KiB RAM platforms, effectively setting new speed records. These contributions extend the practicality of PQC in edge environments and offer a pathway to broader adoption of Kyber in IoT contexts, with potential applicability to other 16-bit modulus LBC schemes.

Abstract

This paper presents another improved version of Plantard arithmetic that could speed up Kyber implementations on two low-end 32-bit IoT platforms (ARM Cortex-M3 and RISC-V) without SIMD extensions. Specifically, we further enlarge the input range of the Plantard arithmetic without modifying its computation steps. After tailoring the Plantard arithmetic for Kyber's modulus, we show that the input range of the Plantard multiplication by a constant is at least 2.14 times larger than the original design in TCHES2022. Then, two optimization techniques for efficient Plantard arithmetic on Cortex-M3 and RISC-V are presented. We show that the Plantard arithmetic supersedes both Montgomery and Barrett arithmetic on low-end 32-bit platforms. With the enlarged input range and the efficient implementation of the Plantard arithmetic on these platforms, we propose various optimization strategies for NTT/INTT. We minimize or entirely eliminate the modular reduction of coefficients in NTT/INTT by taking advantage of the larger input range of the proposed Plantard arithmetic on low-end 32-bit platforms. Furthermore, we propose two memory optimization strategies that reduce 23.50% to 28.31% stack usage for the speed-version Kyber implementation when compared to its counterpart on Cortex-M4. The proposed optimizations make the speed-version implementation more feasible on low-end IoT devices. Thanks to the aforementioned optimizations, our NTT/INTT implementation shows considerable speedups compared to the state-of-the-art work. Overall, we demonstrate the applicability of the speed-version Kyber implementation on memory-constrained IoT platforms and set new speed records for Kyber on these platforms.
Paper Structure (30 sections, 5 equations, 2 figures, 2 tables, 7 algorithms)

This paper contains 30 sections, 5 equations, 2 figures, 2 tables, 7 algorithms.

Figures (2)

  • Figure 1: Example of the first 16 coefficients of the first three layers of a length-128 INTT using GS algorithm on RISC-V. $a_i$ and $a_j$ represent coefficients of polynomial $a$. The red number $x$ in upper left corner represents the maximum absolute value of the input coefficient of INTT. Dashed rectangle represents GS butterflies of various step size; the computation details of GS butterfly are described in the topmost dashed rectangle. The blue number on the left-hand side of the rectangle indicates the step size of the butterfly unit, while the red number represents the maximum absolute value of the corresponding coefficients after the computation of each layer.
  • Figure 2: Example of the first 8 coefficients of the first three layers of a length-128 INTT by using CT algorithm on Cortex-M3. $a_i$ and $a_j$ represent coefficients of the polynomial $a$. The red number $x$ in upper left corner represents the maximum absolute value of the input coefficient of INTT. Dashed rectangle represents light or CT butterflies of various step size; the computation details of light butterfly and CT algorithm are described in the topmost two dashed rectangles; the butterflies with black crossing line denote light butterfly while the butterflies with red crossing line represent CT algorithm; the red number represents the maximum absolute value of the corresponding coefficients after the computation of each layer.