Matthias J. Kannwischer
Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4
The U.S. National Institute of Standards and Technology (NIST) has designated ARM microcontrollers as an important benchmarking platform for its Post-Quantum Cryptography standardization process (NISTPQC). In view of this, we explore the design space of the NISTPQC finalist Saber on the Cortex-M4 and its close relation, the Cortex-M3. In the process, we investigate various optimization strategies and memory-time tradeoffs for number-theoretic transforms (NTTs). Recent work by Chung et al. has shown that NTT multiplication is superior compared to Toom--Cook multiplication for unprotected Saber implementations on the Cortex-M4 in terms of speed. However, it remains unclear if NTT multiplication can outperform Toom--Cook in masked implementations of Saber. Additionally, it is an open question if Saber with NTTs can outperform Toom--Cook in terms of stack usage. We answer both questions in the affirmative. Additionally, we present a Cortex-M3 implementation of Saber using NTTs outperforming an existing Toom--Cook implementation. Our stack-optimized unprotected M4 implementation uses around the same amount of stack as the most stack-optimized implementation using Toom--Cook while being 33%-41% faster. Our speed-optimized masked M4 implementation is 16% faster than the fastest masked implementation using Toom--Cook. For the Cortex-M3, we outperform existing implementations by 29%-35% in speed. We conclude that for both stack- and speed-optimization purposes, one should base polynomial multiplications in Saber on the NTT rather than Toom--Cook for the Cortex-M4 and Cortex-M3. In particular, in many cases, composite moduli NTTs perform best.
Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1
We present new speed records on the Armv8-A architecture for the lattice-based schemes Dilithium, Kyber, and Saber. The core novelty in this paper is the combination of Montgomery multiplication and Barrett reduction resulting in “Barrett multiplication” which allows particularly efficient modular one-known-factor multiplication using the Armv8-A Neon vector instructions. These novel techniques combined with fast two-unknown-factor Montgomery multiplication, Barrett reduction sequences, and interleaved multi-stage butterflies result in significantly faster code. We also introduce “asymmetric multiplication” which is an improved technique for caching the results of the incomplete NTT, used e.g. for matrix-to-vector polynomial multiplication. Our implementations target the Arm Cortex-A72 CPU, on which our speed is 1.7× that of the state-of-the-art matrix-to-vector polynomial multiplication in kyber768 [Nguyen–Gaj 2021]. For Saber, NTTs are far superior to Toom–Cook multiplication on the Armv8-A architecture, outrunning the matrix-to-vector polynomial multiplication by 2.0×. On the Apple M1, our matrix-vector products run 2.1× and 1.9× faster for Kyber and Saber respectively.
NTT Multiplication for NTT-unfriendly Rings: New Speed Records for Saber and NTRU on Cortex-M4 and AVX2 📺
In this paper, we show how multiplication for polynomial rings used in the NIST PQC finalists Saber and NTRU can be efficiently implemented using the Number-theoretic transform (NTT). We obtain superior performance compared to the previous state of the art implementations using Toom–Cook multiplication on both NIST’s primary software optimization targets AVX2 and Cortex-M4. Interestingly, these two platforms require different approaches: On the Cortex-M4, we use 32-bit NTT-based polynomial multiplication, while on Intel we use two 16-bit NTT-based polynomial multiplications and combine the products using the Chinese Remainder Theorem (CRT).For Saber, the performance gain is particularly pronounced. On Cortex-M4, the Saber NTT-based matrix-vector multiplication is 61% faster than the Toom–Cook multiplication resulting in 22% fewer cycles for Saber encapsulation. For NTRU, the speed-up is less impressive, but still NTT-based multiplication performs better than Toom–Cook for all parameter sets on Cortex-M4. The NTT-based polynomial multiplication for NTRU-HRSS is 10% faster than Toom–Cook which results in a 6% cost reduction for encapsulation. On AVX2, we obtain speed-ups for three out of four NTRU parameter sets.As a further illustration, we also include code for AVX2 and Cortex-M4 for the Chinese Association for Cryptologic Research competition award winner LAC (also a NIST round 2 candidate) which outperforms existing code.
Rainbow on Cortex-M4 📺
We present the first Cortex-M4 implementation of the NISTPQC signature finalist Rainbow. We target the Giant Gecko EFM32GG11B which comes with 512 kB of RAM which can easily accommodate the keys of RainbowI.We present fast constant-time bitsliced F16 multiplication allowing multiplication of 32 field elements in 32 clock cycles. Additionally, we introduce a new way of computing the public map P in the verification procedure allowing vastly faster signature verification.Both the signing and verification procedures of our implementation are by far the fastest among the NISTPQC signature finalists. Signing of rainbowIclassic requires roughly 957 000 clock cycles which is 4× faster than the state of the art Dilithium2 implementation and 45× faster than Falcon-512. Verification needs about 239 000 cycles which is 5× and 2× faster respectively. The cost of signing can be further decreased by 20% when storing the secret key in a bitsliced representation.
Single-Trace Attacks on Keccak 📺
Since its selection as the winner of the SHA-3 competition, Keccak, with all its variants, has found a large number of applications. It is, for instance, a common building block in schemes submitted to NIST’s post-quantum cryptography project. In many of these applications, Keccak processes ephemeral secrets. In such a setting, side-channel adversaries are limited to a single observation, meaning that differential attacks are inherently prevented. If, however, such a single trace of Keccak can already be sufficient for key recovery has so far been unknown. In this paper, we change the above by presenting the first single-trace attack targeting Keccak. Our method is based on soft-analytical side-channel attacks and, thus, combines template matching with message passing in a graphical model of the attacked algorithm. As a straight-forward model of Keccak does not yield satisfactory results, we describe several optimizations for the modeling and the message-passing algorithm. Their combination allows attaining high attack performance in terms of both success rate as well as computational runtime. We evaluate our attack assuming generic software (microcontroller) targets and thus use simulations in the generic noisy Hamming-weight leakage model. Hence, we assume relatively modest profiling capabilities of the adversary. Nonetheless, the attack can reliably recover secrets in a large number of evaluated scenarios at realistic noise levels. Consequently, we demonstrate the need for countermeasures even in settings where DPA is not a threat.
Compact Dilithium Implementations on Cortex-M3 and Cortex-M4 📺
We present implementations of the lattice-based digital signature scheme Dilithium for ARM Cortex-M3 and ARM Cortex-M4. Dilithium is one of the three signature finalists of the NIST post-quantum cryptography competition. As our Cortex-M4 target, we use the popular STM32F407-DISCOVERY development board. Compared to the previous speed records on the Cortex-M4 by Ravi, Gupta, Chattopadhyay, and Bhasin we speed up the key operations NTT and NTT−1 by 20% which together with other optimizations results in speedups of 7%, 15%, and 9% for Dilithium3 key generation, signing, and verification respectively. We also present the first constant-time Dilithium implementation on the Cortex-M3 and use the Arduino Due for benchmarks. For Dilithium3, we achieve on average 2 562 kilocycles for key generation, 10 667 kilocycles for signing, and 2 321 kilocycles for verification.Additionally, we present stack consumption optimizations applying to both our Cortex- M3 and Cortex-M4 implementation. Due to the iterative nature of the Dilithium signing algorithm, there is no optimal way to achieve the best speed and lowest stack consumption at the same time. We present three different strategies for the signing procedure which allow trading more stack and flash memory for faster speed or viceversa. Our implementation of Dilithium3 with the smallest memory footprint uses less than 12kB. As an additional output of this work, we present the first Cortex-M3 implementations of the key-encapsulation schemes NewHope and Kyber.