Table of Contents
Fetching ...

Benchmarking Post-Quantum Cryptography on Resource-Constrained IoT Devices: ML-KEM and ML-DSA on ARM Cortex-M0+

Rojin Chhetri

Abstract

The migration to post-quantum cryptography is urgent for Internet of Things devices with 10-20 year lifespans, yet no systematic benchmarks exist for the finalised NIST standards on the most constrained 32-bit processor class. This paper presents the first isolated algorithm-level benchmarks of ML-KEM (FIPS 203) and ML-DSA (FIPS 204) on ARM Cortex-M0+, measured on the RP2040 (Raspberry Pi Pico) at 133 MHz with 264 KB SRAM. Using PQClean reference C implementations, we measure all three security levels of ML-KEM (512/768/1024) and ML-DSA (44/65/87) across key generation, encapsulation/signing, and decapsulation/verification. ML-KEM-512 completes a full key exchange in 36.3 ms consuming 2.87 mJ--17x faster and 94% less energy than ECDH P-256 on the same hardware. ML-DSA signing exhibits high latency variance due to rejection sampling (coefficient of variation 61-71%, 99th-percentile up to 1,115 ms for ML-DSA-87). The M0+ incurs only a 1.8-1.9x slowdown relative to published Cortex-M4 results, despite lacking 64-bit multiply, DSP, and SIMD instructions. All code, data, and scripts are released as an open-source benchmark suite for reproducibility.

Benchmarking Post-Quantum Cryptography on Resource-Constrained IoT Devices: ML-KEM and ML-DSA on ARM Cortex-M0+

Abstract

The migration to post-quantum cryptography is urgent for Internet of Things devices with 10-20 year lifespans, yet no systematic benchmarks exist for the finalised NIST standards on the most constrained 32-bit processor class. This paper presents the first isolated algorithm-level benchmarks of ML-KEM (FIPS 203) and ML-DSA (FIPS 204) on ARM Cortex-M0+, measured on the RP2040 (Raspberry Pi Pico) at 133 MHz with 264 KB SRAM. Using PQClean reference C implementations, we measure all three security levels of ML-KEM (512/768/1024) and ML-DSA (44/65/87) across key generation, encapsulation/signing, and decapsulation/verification. ML-KEM-512 completes a full key exchange in 36.3 ms consuming 2.87 mJ--17x faster and 94% less energy than ECDH P-256 on the same hardware. ML-DSA signing exhibits high latency variance due to rejection sampling (coefficient of variation 61-71%, 99th-percentile up to 1,115 ms for ML-DSA-87). The M0+ incurs only a 1.8-1.9x slowdown relative to published Cortex-M4 results, despite lacking 64-bit multiply, DSP, and SIMD instructions. All code, data, and scripts are released as an open-source benchmark suite for reproducibility.
Paper Structure (31 sections, 5 figures, 8 tables)

This paper contains 31 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: ML-DSA-44 signing time distribution over 100 runs. The geometric-like shape reflects FIPS 204 rejection sampling: most signatures succeed within 1--2 iterations, but tail events exceed 500 ms.
  • Figure 2: ML-DSA signing latency variance across security levels. Box plots show the interquartile range; outliers represent high-iteration rejection sampling events.
  • Figure 3: Key exchange latency comparison on ARM Cortex-M0+. ML-KEM-512 achieves a full handshake in 36.3ms, $17\times$ faster than ECDH P-256.
  • Figure 4: ML-KEM handshake time: Cortex-M0+ (this work) vs Cortex-M4 (pqm4). The M0+ incurs a modest 1.8--1.9$\times$ slowdown despite lacking UMULL, DSP, and SIMD instructions.
  • Figure 5: Energy per key exchange on RP2040 (estimated from datasheet: 3.3V, 24mA). ML-KEM-512 consumes 2.87mJ---94% less than ECDH P-256.