Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4
The U.S. National Institute of Standards and Technology (NIST) has designated ARM microcontrollers as an important benchmarking platform for its Post-Quantum Cryptography standardization process (NISTPQC). In view of this, we explore the design space of the NISTPQC finalist Saber on the Cortex-M4 and its close relation, the Cortex-M3. In the process, we investigate various optimization strategies and memory-time tradeoffs for number-theoretic transforms (NTTs).Recent work by [Chung et al., TCHES 2021 (2)] has shown that NTT multiplication is superior compared to Toom–Cook multiplication for unprotected Saber implementations on the Cortex-M4 in terms of speed. However, it remains unclear if NTT multiplication can outperform Toom–Cook in masked implementations of Saber. Additionally, it is an open question if Saber with NTTs can outperform Toom–Cook in terms of stack usage. We answer both questions in the affirmative. Additionally, we present a Cortex-M3 implementation of Saber using NTTs outperforming an existing Toom–Cook implementation. Our stack-optimized unprotected M4 implementation uses around the same amount of stack as the most stack-optimized Toom–Cook implementation while being 33%-41% faster. Our speed-optimized masked M4 implementation is 16% faster than the fastest masked implementation using Toom–Cook. For the Cortex-M3, we outperform existing implementations by 29%-35% in speed. We conclude that for both stack- and speed-optimization purposes, one should base polynomial multiplications in Saber on the NTT rather than Toom–Cook for the Cortex-M4 and Cortex-M3. In particular, in many cases, multi-moduli NTTs perform best.
Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1
We present new speed records on the Armv8-A architecture for the latticebased schemes Dilithium, Kyber, and Saber. The core novelty in this paper is the combination of Montgomery multiplication and Barrett reduction resulting in “Barrett multiplication” which allows particularly efficient modular one-known-factor multiplication using the Armv8-A Neon vector instructions. These novel techniques combined with fast two-unknown-factor Montgomery multiplication, Barrett reduction sequences, and interleaved multi-stage butterflies result in significantly faster code. We also introduce “asymmetric multiplication” which is an improved technique for caching the results of the incomplete NTT, used e.g. for matrix-to-vector polynomial multiplication. Our implementations target the Arm Cortex-A72 CPU, on which our speed is 1.7× that of the state-of-the-art matrix-to-vector polynomial multiplication in kyber768 [Nguyen–Gaj 2021]. For Saber, NTTs are far superior to Toom–Cook multiplication on the Armv8-A architecture, outrunning the matrix-to-vector polynomial multiplication by 2.0×. On the Apple M1, our matrix-vector products run 2.1× and 1.9× faster for Kyber and Saber respectively.
Multi-Parameter Support with NTTs for NTRU and NTRU Prime on Cortex-M4
We propose NTT implementations with each supporting at least one parameter of NTRU and one parameter of NTRU Prime. Our implementations are based on size-1440, size-1536, and size-1728 convolutions without algebraic assumptions on the target polynomial rings. We also propose several improvements for the NTT computation. Firstly, we introduce dedicated radix-(2, 3) butterflies combining Good–Thomas FFT and vector-radix FFT. In general, there are six dedicated radix-(2, 3) butterflies and they together support implicit permutations. Secondly, for odd prime radices, we show that the multiplications for one output can be replaced with additions/subtractions. We demonstrate the idea for radix-3 and show how to extend it to any odd prime. Our improvement also applies to radix-(2, 3) butterflies. Thirdly, we implement an incomplete version of Good–Thomas FFT for addressing potential code size issues. For NTRU, our polynomial multiplications outperform the state-of-the-art by 2.8%−10.3%. For NTRU Prime, our polynomial multiplications are slower than the state-of-the-art. However, the SotA exploits the specific structure of coefficient rings or polynomial moduli, while our NTT-based multiplications exploit neither and apply across different schemes. This reduces the engineering effort, including testing and verification.
Verified NTT Multiplications for NISTPQC KEM Lattice Finalists: Kyber, SABER, and NTRU
Postquantum cryptography requires a different set of arithmetic routines from traditional public-key cryptography such as elliptic curves. In particular, in each of the lattice-based NISTPQC Key Establishment finalists, every state-ofthe-art optimized implementation for lattice-based schemes still in the NISTPQC round 3 currently uses a different complex multiplication based on the Number Theoretic Transform. We verify the NTT-based multiplications used in NTRU, Kyber, and SABER for both the AVX2 implementation for Intel CPUs and for the pqm4 implementation for the ARM Cortex M4 using the tool CryptoLine. e extended CryptoLine and as a result are able to verify that in six instances multiplications are correct including range properties.We demonstrate the feasibility for a programmer to verify his or her high-speed assembly code for PQC, as well as to verify someone else’s high-speed PQC software in assembly code, with some cooperation from the programmer.
NTT Multiplication for NTT-unfriendly Rings: New Speed Records for Saber and NTRU on Cortex-M4 and AVX2 📺
In this paper, we show how multiplication for polynomial rings used in the NIST PQC finalists Saber and NTRU can be efficiently implemented using the Number-theoretic transform (NTT). We obtain superior performance compared to the previous state of the art implementations using Toom–Cook multiplication on both NIST’s primary software optimization targets AVX2 and Cortex-M4. Interestingly, these two platforms require different approaches: On the Cortex-M4, we use 32-bit NTT-based polynomial multiplication, while on Intel we use two 16-bit NTT-based polynomial multiplications and combine the products using the Chinese Remainder Theorem (CRT).For Saber, the performance gain is particularly pronounced. On Cortex-M4, the Saber NTT-based matrix-vector multiplication is 61% faster than the Toom–Cook multiplication resulting in 22% fewer cycles for Saber encapsulation. For NTRU, the speed-up is less impressive, but still NTT-based multiplication performs better than Toom–Cook for all parameter sets on Cortex-M4. The NTT-based polynomial multiplication for NTRU-HRSS is 10% faster than Toom–Cook which results in a 6% cost reduction for encapsulation. On AVX2, we obtain speed-ups for three out of four NTRU parameter sets.As a further illustration, we also include code for AVX2 and Cortex-M4 for the Chinese Association for Cryptologic Research competition award winner LAC (also a NIST round 2 candidate) which outperforms existing code.
Polynomial Multiplication in NTRU Prime: Comparison of Optimization Strategies on Cortex-M4 📺
This paper proposes two different methods to perform NTT-based polynomial multiplication in polynomial rings that do not naturally support such a multiplication. We demonstrate these methods on the NTRU Prime key-encapsulation mechanism (KEM) proposed by Bernstein, Chuengsatiansup, Lange, and Vredendaal, which uses a polynomial ring that is, by design, not amenable to use with NTT. One of our approaches is using Good’s trick and focuses on speed and supporting more than one parameter set with a single implementation. The other approach is using a mixed radix NTT and focuses on the use of smaller multipliers and less memory. On a ARM Cortex-M4 microcontroller, we show that our three NTT-based implementations, one based on Good’s trick and two mixed radix NTTs, provide between 32% and 17% faster polynomial multiplication. For the parameter-set ntrulpr761, this results in between 16% and 9% faster total operations (sum of key generation, encapsulation, and decapsulation) and requires between 15% and 39% less memory than the current state-of-the-art NTRU Prime implementation on this platform, which is using Toom-Cook-based polynomial multiplication.
- Amin Abdulrahman (1)
- Erdem Alkim (2)
- Hanno Becker (1)
- Jiun-Peng Chen (1)
- Yu-Jia Chen (1)
- Dean Yun-Li Cheng (1)
- Chi-Ming Marvin Chung (2)
- Hülya Evkan (1)
- Leo Wei-Lun Huang (1)
- Matthias J. Kannwischer (3)
- Ching-Lin Trista Li (1)
- Jiaxiang Liu (1)
- Ruben Niederhagen (1)
- Gregor Seiler (2)
- Xiaomu Shi (1)
- Cheng-Jhih Shih (2)
- Ming-Hsien Tsai (1)
- Julian Wälde (1)
- Bow-Yaw Wang (1)
- Shang-Yi Yang (1)
- Bo-Yin Yang (6)