High-Speed Masking for Polynomial Comparison in Lattice-based KEMs 📺
With the NIST post-quantum standardization competition entering the second round, the interest in practical implementation results of the remaining NIST candidates is steadily growing. Especially implementations on embedded devices are often not protected against side-channel attacks, such as differential power analysis. In this regard, the application of countermeasures against side-channel attacks to candidates of the NIST standardization process is still an understudied topic. Our work aims to contribute to the NIST competition by enabling a more realistic judgment of the overhead cost introduced by side-channel countermeasures that are applied to lattice-based KEMs that achieve CCA-security based on the Fujisaki-Okamoto transform. We present a novel higher-order masking scheme that enables an efficient comparison of polynomials as previous techniques based on arithmetic-to-Boolean conversions renders this (generally inexpensive) component extremely expensive in the masked case. Our approach has linear complexity in the number of shares compared to quadratic complexity of previous contributions and it applies to lattice based schemes with prime modulus. It comes with a proof in the probing model and an efficient implementation on an ARM Cortex-M4F microcontroller which was defined as a preferred evaluation platform for embedded implementations by NIST. Our algorithm can be executed in only 1.5-2.2 milliseconds on the target platform (depending on the masking order) and is therefore well suited even for lightweight applications. While in previous work, practical side-channel experiments were conducted using only 5,000 - 100,000 power traces, we confirm the absence of first-order leakage in this work by collecting 1 million power traces and applying the t-test methodology.
Efficiently Masking Binomial Sampling at Arbitrary Orders for Lattice-Based Crypto
With the rising popularity of lattice-based cryptography, the Learning with Errors (LWE) problem has emerged as a fundamental core of numerous encryption and key exchange schemes. Many LWE-based schemes have in common that they require sampling from a discrete Gaussian distribution which comes with a number of challenges for the practical instantiation of those schemes. One of these is the inclusion of countermeasures against a physical side-channel adversary. While several works discuss the protection of samplers against timing leaks, only few publications explore resistance against other side-channels, e.g., power. The most recent example of a protected binomial sampler (as used in key encapsulation mechanisms to sufficiently approximate Gaussian distributions) from CHES 2018 is restricted to a first-order adversary and cannot be easily extended to higher protection orders.In this work, we present the first protected binomial sampler which provides provable security against a side-channel adversary at arbitrary orders. Our construction relies on a new conversion between Boolean and arithmetic (B2A) masking schemes for prime moduli which outperforms previous algorithms significantly for the relevant parameters, and is paired with a new masked bitsliced sampler allowing secure and efficient sampling even at larger protection orders. Since our proposed solution supports arbitrary moduli, it can be utilized in a large variety of lattice-based constructions, like NewHope, LIMA, Saber, Kyber, HILA5, or Ding Key Exchange.
Practical CCA2-Secure and Masked Ring-LWE Implementation 📺
During the last years public-key encryption schemes based on the hardness of ring-LWE have gained significant popularity. For real-world security applications assuming strong adversary models, a number of practical issues still need to be addressed. In this work we thus present an instance of ring-LWE encryption that is protected against active attacks (i.e., adaptive chosen-ciphertext attacks) and equipped with countermeasures against side-channel analysis. Our solution is based on a postquantum variant of the Fujisaki-Okamoto (FO) transform combined with provably secure first-order masking. To protect the key and message during decryption, we developed a masked binomial sampler that secures the re-encryption process required by FO. Our work shows that CCA2-secured RLWE-based encryption can be achieved with reasonable performance on constrained devices but also stresses that the required transformation and handling of decryption errors implies a performance overhead that has been overlooked by the community so far. With parameters providing 233 bits of quantum security, our implementation requires 4,176,684 cycles for encryption and 25,640,380 cycles for decryption with masking and hiding countermeasures on a Cortex-M4F. The first-order security of our masked implementation is also practically verified using the non-specific t-test evaluation methodology.
Standard Lattice-Based Key Encapsulation on Embedded Devices
Lattice-based cryptography is one of the most promising candidates being considered to replace current public-key systems in the era of quantum computing. In 2016, Bos et al. proposed the key exchange scheme FrodoCCS, that is also a submission to the NIST post-quantum standardization process, modified as a key encapsulation mechanism (FrodoKEM). The security of the scheme is based on standard lattices and the learning with errors problem. Due to the large parameters, standard latticebased schemes have long been considered impractical on embedded devices. The FrodoKEM proposal actually comes with parameters that bring standard lattice-based cryptography within reach of being feasible on constrained devices. In this work, we take the final step of efficiently implementing the scheme on a low-cost FPGA and microcontroller devices and thus making conservative post-quantum cryptography practical on small devices. Our FPGA implementation of the decapsulation (the computationally most expensive operation) needs 7,220 look-up tables (LUTs), 3,549 flip-flops (FFs), a single DSP, and only 16 block RAM modules. The maximum clock frequency is 162 MHz and it takes 20.7 ms for the execution of the decapsulation. Our microcontroller implementation has a 66% reduced peak stack usage in comparison to the reference implementation and needs 266 ms for key pair generation, 284 ms for encapsulation, and 286 ms for decapsulation. Our results contribute to the practical evaluation of a post-quantum standardization candidate.