Yet another Improvement of Plantard Arithmetic for Faster Kyber on Low-end 32-bit IoT Devices
Junhao Huang, Haosong Zhao, Jipeng Zhang, Wangchen Dai, Lu Zhou, Ray C. C. Cheung, Cetin Kaya Koc, Donglong Chen
TL;DR
The paper addresses the challenge of efficiently implementing Kyber on memory-constrained 32-bit IoT devices. It advances Plantard arithmetic by enlarging its input range and tailoring it to Kyber's modulus, then couples these arithmetic improvements with ISA-specific optimizations for Cortex-M3 and RISC-V to minimize reductions and memory usage in NTT/INTT. The results show significant speedups (NTT/INTT) and substantial stack-memory reductions, enabling fast Kyber deployments on devices like Cortex-M3 and 16KiB RAM platforms, effectively setting new speed records. These contributions extend the practicality of PQC in edge environments and offer a pathway to broader adoption of Kyber in IoT contexts, with potential applicability to other 16-bit modulus LBC schemes.
Abstract
This paper presents another improved version of Plantard arithmetic that could speed up Kyber implementations on two low-end 32-bit IoT platforms (ARM Cortex-M3 and RISC-V) without SIMD extensions. Specifically, we further enlarge the input range of the Plantard arithmetic without modifying its computation steps. After tailoring the Plantard arithmetic for Kyber's modulus, we show that the input range of the Plantard multiplication by a constant is at least 2.14 times larger than the original design in TCHES2022. Then, two optimization techniques for efficient Plantard arithmetic on Cortex-M3 and RISC-V are presented. We show that the Plantard arithmetic supersedes both Montgomery and Barrett arithmetic on low-end 32-bit platforms. With the enlarged input range and the efficient implementation of the Plantard arithmetic on these platforms, we propose various optimization strategies for NTT/INTT. We minimize or entirely eliminate the modular reduction of coefficients in NTT/INTT by taking advantage of the larger input range of the proposed Plantard arithmetic on low-end 32-bit platforms. Furthermore, we propose two memory optimization strategies that reduce 23.50% to 28.31% stack usage for the speed-version Kyber implementation when compared to its counterpart on Cortex-M4. The proposed optimizations make the speed-version implementation more feasible on low-end IoT devices. Thanks to the aforementioned optimizations, our NTT/INTT implementation shows considerable speedups compared to the state-of-the-art work. Overall, we demonstrate the applicability of the speed-version Kyber implementation on memory-constrained IoT platforms and set new speed records for Kyber on these platforms.
