International Association for Cryptologic Research

International Association
for Cryptologic Research

CryptoDB

Papers from Transactions on Cryptographic Hardware and Embedded Systems 2022

Year
Venue
Title
2022
TCHES
A Compact and High-Performance Hardware Architecture for CRYSTALS-Dilithium
The lattice-based CRYSTALS-Dilithium scheme is one of the three thirdround digital signature finalists in the National Institute of Standards and Technology Post-Quantum Cryptography Standardization Process. Due to the complex calculations and highly individualized functions in Dilithium, its hardware implementations face the problems of large area requirements and low efficiency. This paper proposes several optimization methods to achieve a compact and high-performance hardware architecture for round 3 Dilithium. Specifically, a segmented pipelined processing method is proposed to reduce both the storage requirements and the processing time. Moreover, several optimized modules are designed to improve the efficiency of the proposed architecture, including a pipelined number theoretic transform module, a SampleInBall module, a Decompose module, and three modular reduction modules. Compared with state-of-the-art designs for Dilithium on similar platforms, our implementation requires 1.4×/1.4×/3.0×/4.5× fewer LUTs/FFs/BRAMs/DSPs, respectively, and 4.4×/1.7×/1.4× less time for key generation, signature generation, and signature verification, respectively, for NIST security level 5.
2022
TCHES
A Constant-time AVX2 Implementation of a Variant of ROLLO
This paper introduces a key encapsulation mechanism ROLLO+ and presents a constant-time AVX2 implementation of it. ROLLO+ is a variant of ROLLO-I targeting IND-CPA security. The main difference between ROLLO+ and ROLLO-I is that the decoding algorithm of ROLLO+ is adapted from the decoding algorithm of ROLLO-I. Our implementation of ROLLO+-I-128, one of the level-1 parameter sets of ROLLO+, takes 851823 Skylake cycles for key generation, 30361 Skylake cycles for encapsulation, and 673666 Skylake cycles for decapsulation. Compared to the state-of-the-art implementation of ROLLO-I-128 by Aguilar-Melchor et al., which is claimed to be constant-time but actually is not, our implementation achieves a 12.9x speedup for key generation, a 10.6x speedup for encapsulation, and a 14.5x speedup for decapsulation. Compared to the state-of-the-art implementation of the level-1 parameter set of BIKE by Chen, Chou, and Krausz, our key generation time is 1.4x as slow, but our encapsulation time is 3.8x as fast, and our decapsulation time is 2.4x as fast.
2022
TCHES
A Fast Large-Integer Extended GCD Algorithm and Hardware Design for Verifiable Delay Functions and Modular Inversion
The extended GCD (XGCD) calculation, which computes Bézout coefficients ba, bb such that ba ∗ a0 + bb ∗ b0 = GCD(a0, b0), is a critical operation in many cryptographic applications. In particular, large-integer XGCD is computationally dominant for two applications of increasing interest: verifiable delay functions that square binary quadratic forms within a class group and constant-time modular inversion for elliptic curve cryptography. Most prior work has focused on fast software implementations. The few works investigating hardware acceleration build on variants of Euclid’s division-based algorithm, following the approach used in optimized software. We show that adopting variants of Stein’s subtraction-based algorithm instead leads to significantly faster hardware. We quantify this advantage by performing a large-integer XGCD accelerator design space exploration comparing Euclid- and Stein-based algorithms for various application requirements. This exploration leads us to an XGCD hardware accelerator that is flexible and efficient, supports fast average and constant-time evaluation, and is easily extensible for polynomial GCD. Our 16nm ASIC design calculates 1024-bit XGCD in 294ns (8x faster than the state-of-the-art ASIC) and constant-time 255-bit XGCD for inverses in the field of integers modulo the prime 2255−19 in 85ns (31× faster than state-of-the-art software). We believe our design is the first high-performance ASIC for the XGCD computation that is also capable of constant-time evaluation. Our work is publicly available at https://github.com/kavyasreedhar/sreedhar-xgcd-hardware-ches2022.
2022
TCHES
A Faster Third-Order Masking of Lookup Tables
Masking of S-boxes using lookup tables is an effective countermeasure to thwart side-channel attacks on block ciphers implemented in software. At first and second orders, the Table-based Masking (TBM) schemes can be very efficient and even faster than circuit-based masking schemes. Ever since the customised second-order TBM schemes were proposed, the focus has been on designing and optimising Higher-Order Table-based Masking (HO-TBM) schemes that facilitate masking at arbitrary order. One of the reasons for this trend is that at large orders HO-TBM schemes are significantly slower and consume a prohibitive amount of RAM memory compared to circuit-based masking schemes such as bit-sliced masking, and hence efforts were targeted in this direction. However, a recent work due to Valiveti and Vivek (TCHES 2021) has demonstrated that the HO-TBM scheme of Coron et al. (TCHES 2018) is feasible to be implemented on memory-constrained devices with pre-processing capability and a competitive online execution time. Yet, currently, there are no customised designs for third-order TBM that are more efficient than instantiating a HO-TBM scheme at third order.In this work, we propose a third-order TBM scheme for arbitrary S-boxes that is secure in the probing model and under compositions, i.e., 3-SNI secure. It is very efficient in terms of the overall running time, compared to the third-order instantiations of state-of-the-art HO-TBM schemes. It also supports the pre-processing functionality. For example, the overall running time of a single execution of the third-order masked AES-128 on a 32-bit ARM-Cortex M4 micro-controller is reduced by about 80% without any overhead on the online execution time. This implies that the online execution time of the proposed scheme is approximately eight times faster than the bit-sliced masked implementation at third order, and it is comparable to the recent scheme of Wang et al. (TCHES 2022) that makes use of reuse of shares. We also present the implementation results for the third-order masked PRESENT cipher. Our work suggests that there is a significant scope for tuning the performance of HO-TBM schemes at lower orders.
2022
TCHES
A Finer-Grain Analysis of the Leakage (Non) Resilience of OCB
OCB3 is one of the winners of the CAESAR competition and is among the most popular authenticated encryption schemes. In this paper, we put forward a fine-grain study of its security against side-channel attacks. We start from trivial key recoveries in settings where the mode can be attacked with standard Differential Power Analysis (DPA) against some block cipher calls in its execution (namely, initialization, processing of associated data or last incomplete block and decryption). These attacks imply that at least these parts must be strongly protected thanks to countermeasures like masking. We next show that if these block cipher calls of the mode are protected, practical attacks on the remaining block cipher calls remain possible. A first option is to mount a DPA with unknown inputs. A more efficient option is to mount a DPA that exploits horizontal relations between consecutive input whitening values. It allows trading a significantly reduced data complexity for a higher key guessing complexity and turns out to be the best attack vector in practical experiments performed against an implementation of OCB3 in an ARM Cortex-M0. Eventually, we consider an implementation where all the block cipher calls are protected. We first show that exploiting the leakage of the whitening values requires mounting a Simple Power Analysis (SPA) against linear operations. We then show that despite being more challenging than when applied to non-linear operations, such an SPA remains feasible against 8-bit implementations, leaving its generalization to larger implementations as an interesting open problem. We last describe how recovering the whitening values can lead to strong attacks against the confidentiality and integrity of OCB3. Thanks to this comprehensive analysis, we draw concrete requirements for side-channel resistant implementations of OCB3.
2022
TCHES
A Key-Recovery Side-Channel Attack on Classic McEliece Implementations
In this paper, we propose the first key-recovery side-channel attack on Classic McEliece, a KEM finalist in the NIST Post-quantum Cryptography Standardization Project. Our novel idea is to design an attack algorithm where we submit special ciphertexts to the decryption oracle that correspond to cases of single errors. Decoding of such ciphertexts involves only a single entry in a large secret permutation, which is part of the secret key. Through an identified leakage in the additive FFT step used to evaluate the error locator polynomial, a single entry of the secret permutation can be determined. Iterating this for other entries leads to full secret key recovery. The attack is described using power analysis both on the FPGA reference implementation and a software implementation running on an ARM Cortex-M4. We use a machine-learning-based classification algorithm to determine the error locator polynomial from a single trace. The attack is fully implemented and evaluated in the Chipwhisperer framework and is successful in practice. For the smallest parameter set, it is using about 300 traces for partial key recovery and less than 800 traces for full key recovery, in the FPGA case. A similar number of traces are required for a successful attack on the ARM software implementation.
2022
TCHES
A Power to Pulse Width Modulation Sensor for Remote Power Analysis Attacks
Field-programmable gate arrays (FPGAs) deployed on commercial cloud services are increasingly gaining popularity due to the cost and compute benefits offered by them. Recent studies have discovered security threats than can be launched remotely on FPGAs that share the logic fabric between trusted and untrusted parties, posing a danger to designs deployed on cloud FPGAs. With remote power analysis (RPA) attacks, an attacker aims to deduce secret information present on a remote FPGA by deploying an on-chip sensor on the FPGA logic fabric. Information captured with the on-chip sensor is transferred off the chip for analysis and existing on-chip sensors demand a significant amount of bandwidth for this task as a result of their wider output bit width. However, attackers are often left with the only option of using a covert communication channel and the bandwidth of such channels is generally limited. This paper proposes a novel area-efficient on-chip power sensor named PPWM that integrates a logic design outputting a pulse whose width is modulated by the power consumption of the FPGA. This pulse is used to clear a flip-flop selectively and asynchronously, and the single-bit output of the flip-flop is used to perform an RPA attack. This paper demonstrates the possibility of successfully recovering a 128-bit Advanced Encryption Standard (AES) key within 16,000 power traces while consuming just 25% of the bandwidth when compared to the state of the art. Moreover, this paper assesses the threat posed by the proposed PPWM to remote FPGAs including those that are deployed on cloud services.
2022
TCHES
A Security Model for Randomization-based Protected Caches
Cache side-channel attacks allow adversaries to learn sensitive information about co-running processes by using only access latency measures and cache contention.This vulnerability has been shown to lead to several microarchitectural attacks. As a promising solution, recent work proposes Randomization-based Protected Caches (RPCs). RPCs randomize cache addresses, changing keys periodically so as to avoid long-term leakage. Unfortunately, recent attacks have called the security of state-of-the-art RPCs into question.In this work, we tackle the problem of formally defining and analyzing the security properties of RPCs. We first give security definitions against access-based cache sidechannel attacks that capture security against known attacks such as Prime+Probe and Evict+Probe. Then, using these definitions, we obtain results that allow to guarantee security by adequately choosing the rekeying period, the key generation algorithm and the cache randomizer, thus providing security proofs for RPCs under certain assumptions.
2022
TCHES
ABE Squared: Accurately Benchmarking Efficiency of Attribute-Based Encryption
Measuring efficiency is difficult. In the last decades, several works have contributed in the quest to successfully determine and compare the efficiency of pairing-based attribute-based encryption (ABE) schemes. However, many of these works are limited: they use little to no optimizations, or use underlying pairingfriendly elliptic curves that do not provide sufficient security anymore. Hence, using these works to benchmark ABE schemes does not yield accurate results. Furthermore, most ABE design papers focus on the efficiency of one important aspect. For instance, a new scheme may aim to have a fast decryption algorithm. Upon realizing this goal, the designer compares the new scheme with existing ones, demonstrating its dominance in this particular aspect. Although this approach is intuitive and might seem fair, the way in which this comparison is done might be biased. For instance, the schemes that are compared with the new scheme may be optimized with respect to another aspect, and appear in the comparison consequently inferior.In this work, we present a framework for accurately benchmarking efficiency of ABE: ABE Squared. In particular, we focus on uncovering the multiple layers of optimization that are relevant to the implementation of ABE schemes. Moreover, we focus on making any comparison fairer by considering the influence of the potential design goals on any optimizations. On the lowest layer, we consider the available optimized arithmetic provided by state-of-the-art cryptographic libraries. On the higher layers, we consider the choice of elliptic curve, the order of the computations, and importantly, the instantiation of the scheme on the chosen curves. Additionally, we show that especially the higher-level optimizations are dependent on the goal of the designer, e.g. optimization of the decryption algorithm. To compare schemes more transparently, we develop this framework, in which ABE schemes can be justifiably optimized and compared by taking into account the possible goals of a designer. To meet these goals, we also introduce manual, heuristic type-conversion techniques where existing techniques fall short. Finally, to illustrate the effectiveness of ABE Squared, we implement several schemes and provide all relevant benchmarks. These show that the design goal influences the optimization approaches, which in turn influence the overall efficiency of the implementations. Importantly, these demonstrate that the schemes also compare differently than existing works previously suggested.
2022
TCHES
Adapting Belief Propagation to Counter Shuffling of NTTs
The Number Theoretic Transform (NTT) is a major building block in recently introduced lattice based post-quantum (PQ) cryptography. The NTT was target of a number of recently proposed Belief Propagation (BP)-based Side Channel Attacks (SCAs). Ravi et al. have recently proposed a number of countermeasures mitigating these attacks.In 2021, Hamburg et al. presented a chosen-ciphertext enabled SCA improving noise-resistance, which we use as a starting point to state our findings. We introduce a pre-processing step as well as a new factor node which we call shuffle node. Shuffle nodes allow for a modified version of BP when included into a factor graph. The node iteratively learns the shuffling permutation of fine shuffling within a BP run.We further expand our attacker model and describe several matching algorithms to find inter-layer connections based on shuffled measurements. Our matching algorithm allows for either mixing prior distributions according to a doubly stochastic mix matrix or to extract permutations and perform an exact un-matching of layers. We additionally discuss the usage of sub-graph inference to reduce uncertainty and improve un-shuffling of butterflies.Based on our results, we conclude that the proposed countermeasures of Ravi et al. are powerful and counter Hamburg et al., yet could lead to a false security perception – a powerful adversary could still launch successful attacks. We discuss on the capabilities needed to defeat shuffling in the setting of Hamburg et al. using our expanded attacker model.Our methods are not limited to the presented case but provide a toolkit to analyze and evaluate shuffling countermeasures in BP-based attack scenarios.
2022
TCHES
An energy and area efficient, all digital entropy source compatible with modern standards based on jitter pipelining
This paper proposes an energy and area efficient entropy source, suitable for true random number generation, accompanied with a stochastic model in a 28nm CMOS technology. The design uses a jitter pipelining architecture together with an increased timing resolution to achieve a maximal throughput of 298 Mbit/s and a best energy efficiency of 1.46 pJ/bit at a supply of 0.8V. The generated random bits pass the NIST SP 800-90B IID tests with a min entropy rate of 0.933 bit/bit, which is more than required by the AIS-31 standard. The all digital design allows for effortless transfer to other technology nodes, taking advantage of all benefits related to further technology scaling.
2022
TCHES
Attacks Against White-Box ECDSA and Discussion of Countermeasures: A Report on the WhibOx Contest 2021
This paper deals with white-box implementations of the Elliptic Curve Digital Signature Algorithm (ECDSA): First, we consider attack paths to break such implementations. In particular, we provide a systematic overview of various fault attacks, to which ECDSA white-box implementations are especially susceptible. Then, we propose different mathematical countermeasures, mainly based on masking/blinding of sensitive variables, in order to prevent or at least make such attacks more difficult. We also briefly mention some typical implementational countermeasures and their challenges in the ECDSA white-box scenario. Our work has been initiated by the CHES challenge WhibOx Contest 2021, which consisted of designing and breaking white-box ECDSA implementations, so called challenges. We illustrate our results and findings by means of the submitted challenges and provide a comprehensive overview which challenge could be solved in which way. Furthermore, we analyze selected challenges in more details.
2022
TCHES
Automated Generation of Masked Hardware
Masking has been recognized as a sound and secure countermeasure for cryptographic implementations, protecting against physical side-channel attacks. Even though many different masking schemes have been presented over time, design and implementation of protected cryptographic Integrated Circuits (ICs) remains a challenging task. More specifically, correct and efficient implementation usually requires manual interactions accompanied by longstanding experience in hardware design and physical security. To this end, design and implementation of masked hardware often proves to be an error-prone task for engineers and practitioners. As a result, our novel tool for automated generation of masked hardware (AGEMA) allows even inexperienced engineers and hardware designers to create secure and efficient masked cryptograhic circuits originating from an unprotected design. More precisely, exploiting the concepts of Probe-Isolating Non-Interference (PINI) for secure composition of masked circuits, our tool provides various processing techniques to transform an unprotected design into a secure one, eventually accelerating and safeguarding the process of masking cryptographic hardware. Ultimately, we evaluate our tool in several case studies, emphasizing different trade-offs for the transformation techniques with respect to common performance metrics, such as latency, area, andrandomness.
2022
TCHES
BAT: Small and Fast KEM over NTRU Lattices
We present BAT – an IND-CCA secure key encapsulation mechanism (KEM) that is based on NTRU but follows an encryption/decryption paradigm distinct from classical NTRU KEMs. It demonstrates a new approach of decrypting NTRU ciphertext since its introduction 25 years ago. Instead of introducing an artificial masking parameter p to decrypt the ciphertext, we use 2 linear equations in 2 unknowns to recover the message and the error. The encryption process is therefore close to the GGH scheme. However, since the secret key is now a short basis (not a vector), we need to modify the decryption algorithm and we present a new NTRU decoder. Thanks to the improved decoder, our scheme works with a smaller modulus and yields shorter ciphertexts, smaller than RSA-4096 for 128-bit classical security with comparable public-key size and much faster than RSA or even ECC. Meanwhile, the encryption and decryption are still simple and fast in spite of the complicated key generation. Overall, our KEM has more compact parameters than all current lattice-based schemes and a practical efficiency. Moreover, due to the similar key pair structure, BAT can be of special interest in some applications using Falcon signature that is also the most compact signature in the round 3 of the NIST post-quantum cryptography standardization. However, different from Falcon, our KEM does not rely on floating-point arithmetic and can be fully implemented over the integers.
2022
TCHES
Beware of Insufficient Redundancy: An Experimental Evaluation of Code-based FI Countermeasures
Fault injection attacks pose a serious threat to cryptographic implementations. Countermeasures beyond sensors and shields usually deploy some form of redundancy to detect or even correct errors. A few years ago, a novel design methodology called Impeccable Circuits has been introduced on how to correctly integrate Concurrent Error Detection (CED) schemes, based on Error-Detection Codes (EDCs), into cryptographic hardware circuits. The underlying adversary model limits attackers to inject at most t single-bit faults. By additionally considering the propagation of faults in combinational circuits, the countermeasure guarantees detection of any faulty computation caused by up to t single-bit faults.In this work, we present an experimental analysis of the Impeccable Circuits countermeasure and its underlying assumptions in modern semiconductor technology. More precisely, we have taken hardware implementations of the lightweight block cipher SKINNY equipped with various forms of the EDC-based CED schemes and realized them as cryptographic co-processors on a 40nm ASIC to experimentally evaluate their resistance to Laser Fault Injection (LFI) attacks. In short, our results show that it is fairly simple to overcome the protection offered by the integrated countermeasures when the length of the code n is smaller than twice its rank k (i.e., no full redundancy). This is not caused by any flaw in the underlying design methodology or concept, but merely demonstrates how easily the defined adversary model can be overcome. In our case, a standard black-box scan over the target using a common single-shot LFI setup is sufficient to occasionally inject more single-bit faults than those bounded by the underlying adversary model when n < 2k. The probability of such events proved to be large enough to perform successful key-recovery attacks via Differential Fault Analysis (DFA) in a matter of hours. Thus, we caution against limiting the redundancy in code-based FI countermeasures to less than the number of bits per word, especially in nanometer technologies, and point out that less-complex countermeasures like duplication showed a higher level of resistance in our experiments at a lower cost.
2022
TCHES
BipBip: A Low-Latency Tweakable Block Cipher with Small Dimensions
Recently, a memory safety concept called Cryptographic Capability Computing (C3) has been proposed. C3 is the first memory safety mechanism that works without requiring extra storage for metadata and hence, has the potential to significantly enhance the security of modern IT-systems at a rather low cost. To achieve this, C3 heavily relies on ultra-low-latency cryptographic primitives. However, the most crucial primitive required by C3 demands uncommon dimensions. To partially encrypt 64-bit pointers, a 24-bit tweakable block cipher with a 40-bit tweak is needed. The research on low-latency tweakable block ciphers with such small dimensions is not very mature. Therefore, designing such a cipher provides a great research challenge, which we take on with this paper. As a result, we present BipBip, a 24-bit tweakable block cipher with a 40-bit tweak that allows for ASIC implementations with a latency of 3 cycles at a 4.5 GHz clock frequency on a modern 10 nm CMOS technology.
2022
TCHES
Bitslice Masking and Improved Shuffling:: How and When to Mix Them in Software?
We revisit the popular adage that side-channel countermeasures must be combined to be efficient, and study its application to bitslice masking and shuffling. Our main contributions are twofold. First, we improve this combination: by shuffling the shares of a masked implementation rather than its tuples, we can amplify the impact of the shuffling exponentially in the number of shares, while this impact was independent of the masking security order in previous works. Second, we evaluate the masking and shuffling combination’s performance vs. security tradeoff under sufficient noise conditions: we show that the best approach is to mask first (i.e., fill the registers with as many shares as possible) and shuffle the independent operations that remain. We conclude that with moderate but sufficient noise, the “bitslice masking + shuffling” combination of countermeasures is practically relevant, and its interest increases when randomness is expensive and many independent operations are available for shuffling. When these conditions are not met, masking only is the best option. As additional side results, we improve the best known attack against the shuffling countermeasure from ASIACRYPT 2012. We also recall that algorithmic countermeasures like masking and shuffling, and therefore their combination, cannot be implemented securely without a minimum level of physical noise.
2022
TCHES
Bitslicing Arithmetic/Boolean Masking Conversions for Fun and Profit: with Application to Lattice-Based KEMs
The performance of higher-order masked implementations of lattice-based based key encapsulation mechanisms (KEM) is currently limited by the costly conversions between arithmetic and Boolean masking. While bitslicing has been shown to strongly speed up masked implementations of symmetric primitives, its use in arithmetic-to-Boolean and Boolean-to-arithmetic masking conversion gadgets has never been thoroughly investigated. In this paper, we first show that bitslicing can indeed accelerate existing conversion gadgets. We then optimize these gadgets, exploiting the degrees of freedom offered by bitsliced implementations. As a result, we introduce new arbitrary-order Boolean masked addition, arithmetic-to-Boolean and Boolean-to-arithmetic masking conversion gadgets, each in two variants: modulo 2k and modulo p (for any integers k and p). Practically, our new gadgets achieve a speedup of up to 25x over the state of the art. Turning to the KEM application, we develop the first open-source embedded (Cortex-M4) implementations of Kyber768 and Saber masked at arbitrary order. The implementations based on the new bitsliced gadgets achieve a speedup of 1.8x for Kyber and 3x for Saber, compared to the implementation based on state-of-the-art gadgets. The bottleneck of the bitslice implementations is the masked Keccak-f[1600] permutation.
2022
TCHES
Breaking Masked Implementations of the Clyde-Cipher by Means of Side-Channel Analysis: A Report on the CHES Challenge Side-Channel Contest 2020
In this paper we present our solution to the CHES Challenge 2020, the task of which it was to break masked hardware respective software implementations of the lightweight cipher Clyde by means of side-channel analysis. We target the secret cipher state after processing of the first S-box layer. Using the provided trace data we obtain a strongly biased posterior distribution for the secret-shared cipher state at the targeted point; this enables us to see exploitable biases even before the secret sharing based masking. These biases on the unshared state can be evaluated one S-box at a time and combined across traces, which enables us to recover likely key hypotheses S-box by S-box.In order to see the shared cipher state, we employ a deep neural network similar to the one used by Gohr, Jacob and Schindler to solve the CHES 2018 AES challenge. We modify their architecture to predict the exact bit sequence of the secret-shared cipher state. We find that convergence of training on this task is unsatisfying with the standard encoding of the shared cipher state and therefore introduce a different encoding of the prediction target, which we call the scattershot encoding. In order to further investigate how exactly the scattershot encoding helps to solve the task at hand, we construct a simple synthetic task where convergence problems very similar to those we observed in our side-channel task appear with the naive target data encoding but disappear with the scattershot encoding.We complete our analysis by showing results that we obtained with a “classical” method (as opposed to an AI-based method), namely the stochastic approach, thatwe generalize for this purpose first to the setting of shared keys. We show that the neural network draws on a much broader set of features, which may partially explain why the neural-network based approach massively outperforms the stochastic approach. On the other hand, the stochastic approach provides insights into properties of the implementation, in particular the observation that the S-boxes behave very different regarding the easiness respective hardness of their prediction.
2022
TCHES
Bypassing Isolated Execution on RISC-V using Side-Channel-Assisted Fault-Injection and Its Countermeasure
RISC-V is equipped with physical memory protection (PMP) to prevent malicious software from accessing protected memory regions. PMP provides a trusted execution environment (TEE) that isolates secure and insecure applications. In this study, we propose a side-channel-assisted fault-injection attack to bypass isolation based on PMP. The proposed attack scheme involves extracting successful glitch parameters for fault injection from side-channel information under crossdevice conditions. A proof-of-concept TEE compatible with PMP in RISC-V was implemented, and the feasibility and effectiveness of the proposed attack scheme was validated through experiments in TEEs. The results indicate that an attacker can bypass the isolation of the TEE and read data from the protected memory region In addition, we experimentally demonstrate that the proposed attack applies to a real-world TEE, Keystone. Furthermore, we propose a software-based countermeasure that prevents the proposed attack.
2022
TCHES
Can’t Touch This: Inertial HSMs Thwart Advanced Physical Attacks
In this paper, we introduce a novel countermeasure against physical attacks: Inertial Hardware Security Modules (IHSMs). Conventional systems have in common that their security requires the crafting of fine sensor structures that respond to minute manipulations of the monitored security boundary or volume. Our approach is novel in that we reduce the sensitivity requirement of security meshes and other sensors and increase the complexity of any manipulations by rotating the security mesh or sensor at high speed—thereby presenting a moving target to an attacker. Attempts to stop the rotation are easily monitored with commercial MEMS accelerometers and gyroscopes. Our approach leads to an HSM that can easily be built from off-the-shelf parts by any university electronics lab, yet offers a level of security that is comparable to commercial HSMs. We have built a proof-of-concept hardware prototype that demonstrates solutions to the concept’s main engineering challenges. As part of this proof-of-concept, we have found that a system using a coarse security mesh made from commercial printed circuit boards and an automotive high-g-force accelerometer already provides a useful level of security.
2022
TCHES
CFNTT: Scalable Radix-2/4 NTT Multiplication Architecture with an Efficient Conflict-free Memory Mapping Scheme
Number theoretic transform (NTT) is widely utilized to speed up polynomial multiplication, which is the critical computation bottleneck in a lot of cryptographic algorithms like lattice-based post-quantum cryptography (PQC) and homomorphic encryption (HE). One of the tendency for NTT hardware architecture is to support diverse security parameters and meet resource constraints on different computing platforms. Thus flexibility and Area-Time Product (ATP) become two crucial metrics in NTT hardware design. The flexibility of NTT in terms of different vector sizes and moduli can be obtained directly. Whereas the varying strides in memory access of in-place NTT render the design for different radix and number of parallel butterfly units a tough problem. This paper proposes an efficient conflict-free memory mapping scheme that supports the configuration for both multiple parallel butterfly units and arbitrary radix of NTT. Compared to other approaches, this scheme owns broader applicability and facilitates the parallelization of non-radix-2 NTT hardware design. Based on this scheme, we propose a scalable radix-2 and radix-4 NTT multiplication architecture by algorithm-hardware co-design. A dedicated schedule method is leveraged to reduce the number of modular additions/subtractions and modular multiplications in radix-4 butterfly unit by 20% and 33%, respectively. To avoid the bit-reversed cost and save memory footprint in arbitrary radix NTT/INTT, we put forward a general method by rearranging the loop structure and reusing the twiddle factors. The hardware-level optimization is achieved by excavating the symmetric operators in radix-4 butterfly unit, which saves almost 50% hardware resources compared to a straightforward implementation. Through experimental results and theoretical analysis, we point out that the radix-4 NTT with the same number of parallel butterfly units outperforms the radix-2 NTT in terms of area-time performance in the interleaved memory system. This advantage is enlarged when increasing the number of parallel butterfly units. For example, when processing 1024 14-bit points NTT with 8 parallel butterfly units, the ATP of LUT/FF/DSP/BRAM n radix-4 NTT core is approximately 2.2 × /1.2 × /1.1 × /1.9 × less than that of the radix-2 NTT core on a similar FPGA platform.
2022
TCHES
2022
TCHES
Complete and Improved FPGA Implementation of Classic McEliece
We present the first specification-compliant constant-time FPGA implementation of the Classic McEliece cryptosystem from the third-round of NIST’s Post-Quantum Cryptography standardization process. In particular, we present the first complete implementation including encapsulation and decapsulation modules as well as key generation with seed expansion. All the hardware modules are parametrizable, at compile time, with security level and performance parameters. As the most time consuming operation of Classic McEliece is the systemization of the public key matrix during key generation, we present and evaluate three new algorithms that can be used for systemization while complying with the specification: hybrid early-abort systemizer (HEA), single-pass early-abort systemizer (SPEA), and dual-pass earlyabort systemizer (DPEA). All of the designs outperform the prior systemizer designs for Classic McEliece by 2.2x to 2.6x in average runtime and by 1.7x to 2.4x in time-area efficiency. We show that our complete Classic McEliece design for example can perform key generation in 5.2 ms to 20 ms, encapsulation in 0.1 ms to 0.5 ms, and decapsulation in 0.7 ms to 1.5 ms for all security levels on an Xlilinx Artix 7 FPGA. The performance can be increased even further at the cost of resources by increasing the level of parallelization using the performance parameters of our design.
2022
TCHES
Composable Gadgets with Reused Fresh Masks: First-Order Probing-Secure Hardware Circuits with only 6 Fresh Masks
Albeit its many benefits, masking cryptographic hardware designs has proven to be a non-trivial and error-prone task, even for experienced engineers. Masked variants of atomic logic gates, like AND or XOR – commonly referred to as gadgets – aim to facilitate the process of masking large circuits by offering free composition while sustaining the overall design’s security in the d-probing adversary model. A wide variety of research has already been conducted to (i) find formal properties a gadget must fulfill to guarantee composability and (ii) construct gadgets that fulfill these properties, while minimizing overhead requirements. In all existing composition frameworks like NI/SNI/PINI and all corresponding gadget realizations, the security argument relies on the fact that each gadget requires individual fresh randomness. Naturally, this approach leads to very high randomness requirements of the resulting composed circuit. In this work, we present composable gadgets with reused fresh masks (COMAR), allowing the composition of any first-order secure hardware circuit utilizing only 6 fresh masks in total. By construction, our newly presented gadgets render individual fresh randomness unnecessary, while retaining free composition and first-order security in the robust probing model. More precisely, we give an instantiation of gadgets realizing arbitrary XOR and AND gates with an arbitrary number of inputs which can be trivially extended to all basic logic gates. With these, we break the linear dependency between the number of (non-linear) gates in a circuit and the randomness requirements, hence offering the designers the possibility to highly optimize a masked circuit’s randomness requirements while keeping error susceptibility to a minimum.
2022
TCHES
Composite Enclaves: Towards Disaggregated Trusted Execution
The ever-rising computation demand is forcing the move from the CPU to heterogeneous specialized hardware, which is readily available across modern datacenters through disaggregated infrastructure. On the other hand, trusted execution environments (TEEs), one of the most promising recent developments in hardware security, can only protect code confined in the CPU, limiting TEEs’ potential and applicability to a handful of applications. We observe that the TEEs’ hardware trusted computing base (TCB) is fixed at design time, which in practice leads to using untrusted software to employ peripherals in TEEs. Based on this observation, we propose composite enclaves with a configurable hardware and software TCB, allowing enclaves access to multiple computing and IO resources. Finally, we present two case studies of composite enclaves: i) an FPGA platform based on RISC-V Keystone connected to emulated peripherals and sensors, and ii) a large-scale accelerator. These case studies showcase a flexible but small TCB (2.5 KLoC for IO peripherals and drivers), with a low-performance overhead (only around 220 additional cycles for a context switch), thus demonstrating the feasibility of our approach and showing that it can work with a wide range of specialized hardware.
2022
TCHES
Cryptanalysis of Efficient Masked Ciphers: Applications to Low Latency
This work introduces second-order masked implementation of LED, Midori, Skinny, and Prince ciphers which do not require fresh masks to be updated at every clock cycle. The main idea lies on a combination of the constructions given by Shahmirzadi and Moradi at CHES 2021, and the theory presented by Beyne et al. at Asiacrypt 2020. The presented masked designs only use a minimal number of shares, i.e., three to achieve second-order security, and we make use of a trick to pair a couple of S-boxes to reduce their latency. The theoretical security analyses of our constructions are based on the linear-cryptanalytic properties of the underlying masked primitive as well as SILVER, the leakage verification tool presented at Asiacrypt 2020. To improve this cryptanalytic analysis, we use the noisy probing model which allows for the inclusion of noise in the framework of Beyne et al. We further provide FPGA-based experimental security analysis confirming second-order protection of our masked implementations.
2022
TCHES
Curse of Re-encryption: A Generic Power/EM Analysis on Post-Quantum KEMs
This paper presents a side-channel analysis (SCA) on key encapsulation mechanism (KEM) based on the Fujisaki–Okamoto (FO) transformation and its variants. The FO transformation has been widely used in actively securing KEMs from passively secure public key encryption (PKE), as it is employed in most of NIST post-quantum cryptography (PQC) candidates for KEM. The proposed attack exploits side-channel leakage during execution of a pseudorandom function (PRF) or pseudorandom number generator (PRG) in the re-encryption of KEM decapsulation as a plaintext-checking oracle that tells whether the PKE decryption result is equivalent to the reference plaintext. The generality and practicality of the plaintext-checking oracle allow the proposed attack to attain a full-key recovery of various KEMs when an active attack on the underlying PKE is known. This paper demonstrates that the proposed attack can be applied to most NIST PQC third-round KEM candidates, namely, Kyber, Saber, FrodoKEM, NTRU, NTRU Prime, HQC, BIKE, and SIKE (for BIKE, the proposed attack achieves a partial key recovery). The applicability to Classic McEliece is unclear because there is no known active attack on this cryptosystem. This paper also presents a side-channel distinguisher design based on deep learning (DL) for mounting the proposed attack on practical implementation without the use of a profiling device. The feasibility of the proposed attack is evaluated through experimental attacks on various PRF implementations (a SHAKE software, an AES software, an AES hardware, a bit-sliced masked AES software, and a masked AES hardware based on threshold implementation). Although it is difficult to implement the oracle using the leakage from the TI-based masked hardware, the success of the proposed attack against these implementations (even except for the masked hardware), which include masked software, confirms its practicality.
2022
TCHES
Don’t Learn What You Already Know: Scheme-Aware Modeling for Profiling Side-Channel Analysis against Masking
Over the past few years, deep-learning-based attacks have emerged as a de facto standard, thanks to their ability to break implementations of cryptographic primitives without pre-processing, even against widely used counter-measures such as hiding and masking. However, the recent works of Bronchain and Standaert at Tches 2020 questioned the soundness of such tools if used in an uninformed setting to evaluate implementations protected with higher-order masking. On the opposite, worst-case evaluations may be seen as possibly far from what a real-world adversary could do, thereby leading to too conservative security bounds. In this paper, we propose a new threat model that we name scheme-aware benefiting from a trade-off between uninformed and worst-case models. Our scheme-aware model is closer to a real-world adversary, in the sense that it does not need to have access to the random nonces used by masking during the profiling phase like in a worst-case model, while it does not need to learn the masking scheme as implicitly done by an uninformed adversary. We show how to combine the power of deep learning with the prior knowledge of scheme-aware modeling. As a result, we show on simulations and experiments on public datasets how it sometimes allows to reduce by an order of magnitude the profiling complexity, i.e., the number of profiling traces needed to satisfyingly train a model, compared to a fully uninformed adversary.
2022
TCHES
Don’t Reject This: Key-Recovery Timing Attacks Due to Rejection-Sampling in HQC and BIKE
Well before large-scale quantum computers will be available, traditional cryptosystems must be transitioned to post-quantum (PQ) secure schemes. The NIST PQC competition aims to standardize suitable cryptographic schemes. Candidates are evaluated not only on their formal security strengths, but are also judged based on the security with regard to resistance against side-channel attacks. Although round 3 candidates have already been intensively vetted with regard to such attacks, one important attack vector has hitherto been missed: PQ schemes often rely on rejection sampling techniques to obtain pseudorandomness from a specific distribution. In this paper, we reveal that rejection sampling routines that are seeded with secretdependent information and leak timing information result in practical key recovery attacks in the code-based key encapsulation mechanisms HQC and BIKE.Both HQC and BIKE have been selected as alternate candidates in the third round of the NIST competition, which puts them on track for getting standardized separately o the finalists. They have already been specifically hardened with constant-time decoders to avoid side-channel attacks. However, in this paper, we show novel timing vulnerabilities in both schemes: (1) Our secret key recovery attack on HQC requiresonly approx. 866,000 idealized decapsulation timing oracle queries in the 128-bit security setting. It is structurally different from previously identified attacks on the scheme: Previously, exploitable side-channel leakages have been identified in the BCH decoder of a previously submitted HQC version, in the ciphertext check as well as in the pseudorandom function of the Fujisaki-Okamoto transformation. In contrast, our attack uses the fact that the rejection sampling routine invoked during the deterministic re-encryption of the decapsulation leaks secret-dependent timing information, which can be efficiently exploited to recover the secret key when HQC is instantiated with the (now constant-time) BCH decoder, as well as with the RMRS decoder of the current submission. (2) From the timing information of the constant weight word sampler in the BIKE decapsulation, we demonstrate how to distinguish whether the decoding step is successful or not, and how this distinguisher is then used in the framework of the GJS attack to derive the distance spectrum of the secret key, using 5.8 x 107 idealized timing oracle queries. We provide details and analyses of the fully implemented attacks, as well as a discussion on possible countermeasures and their limits.
2022
TCHES
ECDSA White-Box Implementations: Attacks and Designs from CHES 2021 Challenge
Despite the growing demand for software implementations of ECDSA secure against attackers with full control of the execution environment, scientific literature on ECDSA white-box design is scarce. The CHES 2021 WhibOx contest was thus held to assess the state-of-the-art and encourage relevant practical research, inviting developers to submit ECDSA white-box implementations and attackers to break the corresponding submissions.In this work, attackers (team TheRealIdefix) and designers (team zerokey) join to describe several attack techniques and designs used during this contest. We explain the methods used by the team TheRealIdefix, which broke the most challenges, and we show the efficiency of each of these methods against all the submitted implementations. Moreover, we describe the designs of the two winning challenges submitted by the team zerokey; these designs represent the ECDSA signature algorithm by a sequence of systems of low-degree equations, which are obfuscated with affine encodings and extra random variables and equations.The WhibOx contest has shown that securing ECDSA in the white-box model is an open and challenging problem, as no implementation survived more than two days. In this context, our designs provide a starting methodology for further research, and our attacks highlight the weak points future work should address.
2022
TCHES
Efficient Implementations of Rainbow and UOV using AVX2
A signature scheme based on multivariate quadratic equations, Rainbow, was selected as one of digital signature finalists for NIST Post-Quantum Cryptography Standardization Round 3. In this paper, we provide efficient implementations of Rainbow and UOV using the AVX2 instruction set. These efficient implementations include several optimizations for signing to accelerate solving linear systems and the Vinegar value substitution. We propose a new block matrix inversion (BMI) method using the Lower-Diagonal-Upper decomposition of blocks matrices based on the Schur complement that accelerates solving linear systems. Compared to UOV implemented with Gaussian elimination, our implementations with the BMI result in speedups of 12.36%, 24.3%, and 34% for signing at security categories I, III, and V, respectively. Compared to Rainbow implemented with Gaussian elimination, our implementations with the BMI result in speedups of 16.13% and 20.73% at the security categories III and V, respectively. We show that precomputation for the Vinegar value substitution and solving linear systems dramatically improve their signing. UOV with precomputation is 16.9 times, 35.5 times, and 62.8 times faster than UOV without precomputation at the three security categories, respectively. Rainbow with precomputation is 2.1 times, 2.2 times, and 2.8 times faster than Rainbow without precomputation at the three security categories, respectively. We then investigate resilience against leakage or reuse of the precomputed values in UOV and Rainbow to use the precomputation securely: leakage or reuse of the precomputed values leads to their full secret key recoveries in polynomial-time.
2022
TCHES
Exploring Feature Selection Scenarios for Deep Learning-based Side-channel Analysis
One of the main promoted advantages of deep learning in profiling sidechannel analysis is the possibility of skipping the feature engineering process. Despite that, most recent publications consider feature selection as the attacked interval from the side-channel measurements is pre-selected. This is similar to the worst-case security assumptions in security evaluations when the random secret shares (e.g., mask shares) are known during the profiling phase: an evaluator can identify points ofinterest locations and efficiently trim the trace interval. To broadly understand how feature selection impacts the performance of deep learning-based profiling attacks, this paper investigates three different feature selection scenarios that could be realistically used in practical security evaluations. The scenarios range from the minimum possible number of features (worst-case security assumptions) to the whole available traces. Our results emphasize that deep neural networks as profiling models show successful key recovery independently of explored feature selection scenarios against first-order masked software implementations of AES-128. First, we show that feature selection with the worst-case security assumptions results in optimal profiling models that are highly dependent on the number of features and signal-to-noise ratio levels. Second, we demonstrate that attacking raw side-channel measurements with small deep neural networks also provides optimal models, that shortens the gap between worst-case security evaluations and online (realistic) profiling attacks. In all explored feature selection scenarios, the hyperparameter search always indicates a successful model with up to eight hidden layers for MLPs and CNNs, suggesting that complex models are not required for the considered datasets. Our results demonstrate the key recovery with less than ten attack traces for all datasets for at least one of the feature selection scenarios. Additionally, in several cases, we can recover the target key with a single attack trace.
2022
TCHES
Faster Constant-Time Decoder for MDPC Codes and Applications to BIKE KEM
BIKE is a code-based key encapsulation mechanism (KEM) that was recently selected as an alternate candidate by the NIST’s standardization process on post-quantum cryptography. This KEM is based on the Niederreiter scheme instantiated with QC-MDPC codes, and it uses the BGF decoder for key decapsulation. We discovered important limitations of BGF that we describe in detail, and then we propose a new decoding algorithm for QC-MDPC codes called PickyFix. Our decoder uses two auxiliary iterations that are significantly different from previous approaches and we show how they can be implemented efficiently. We analyze our decoder with respect to both its error correction capacity and its performance in practice. When compared to BGF, our constant-time implementation of PickyFix achieves speedups of 1.18, 1.29, and 1.47 for the security levels 128, 192 and 256, respectively.
2022
TCHES
FDFB: Full Domain Functional Bootstrapping Towards Practical Fully Homomorphic Encryption
Computation on ciphertexts of all known fully homomorphic encryption (FHE) schemes induces some noise, which, if too large, will destroy the plaintext. Therefore, the bootstrapping technique that re-encrypts a ciphertext and reduces the noise level remains the only known way of building FHE schemes for arbitrary unbounded computations. The bootstrapping step is also the major efficiency bottleneck in current FHE schemes. A promising direction towards improving concrete efficiency is to exploit the bootstrapping process to perform useful computation while reducing the noise at the same time. We show a bootstrapping algorithm, which embeds a lookup table and evaluates arbitrary functions of the plaintext while reducing the noise. Depending on the choice of parameters, the resulting homomorphic encryption scheme may be either an exact FHE or homomorphic encryption for approximate arithmetic. Since we can evaluate arbitrary functions over the plaintext space, we can use the natural homomorphism of Regev encryption to compute affine functions without bootstrapping almost for free. Consequently, our algorithms are particularly suitable for arithmetic circuits over a finite field with many additions and scalar multiplication gates. We achieve significant speedups when compared to binary circuit-based FHE. For example, we achieve 280-1200x speedups when computing an affine function of size 784 followed by any univariate function when compared to FHE schemes that compute binary circuits. With our bootstrapping algorithm, we can efficiently convert between arithmetic and boolean plaintexts and extend the plaintext space using the Chinese remainder theorem. Furthermore, we can run the computation in an exact and approximate mode where we trade-off the size of the plaintext space with approximation error. We provide a tight error analysis and show several parameter sets for our bootstrapping. Finally, we implement our algorithm and provide extensive tests. We demonstrate our algorithms by evaluating different neural networks in several parameter and accuracy settings.
2022
TCHES
Find the Bad Apples: An efficient method for perfect key recovery under imperfect SCA oracles – A case study of Kyber
Side-channel resilience is a crucial feature when assessing whether a postquantum cryptographic proposal is sufficiently mature to be deployed. In this paper, we propose a generic and efficient adaptive approach to improve the sample complexity (i.e., the required number of traces) of plaintext-checking (PC) oracle-based sidechannel attacks (SCAs), a major class of key recovery chosen-ciphertext SCAs on lattice-based key encapsulation mechanisms (KEMs). This new approach is preferable when the constructed PC oracle is imperfect, which is common in practice, and its basic idea is to design new detection codes that can determine erroneous positions in the initially recovered secret key. These secret entries are further corrected with a small number of additional traces. This work benefits from the generality of PC oracle and thus is applicable to various schemes and implementations.Our main target is Kyber since it has been selected by NIST as the KEM algorithm for standardization. We instantiated the proposed generic attack on Kyber512 and then conducted extensive computer simulations against Kyber512 and FireSaber. We further mounted an electromagnetic (EM) attack against an optimized implementation of Kyber512 in the pqm4 library running on an STM32F407G board with an ARM Cortex-M4 microcontroller. These simulations and real-world experiments demonstrate that the newly proposed attack could greatly improve the state-of-the-art in terms of the required number of traces. For instance, the new attack requires only 41% of the EM traces needed in a majority-voting attack in our experiments, where the raw oracle accuracy is fixed.
2022
TCHES
Free Fault Leakages for Deep Exploitation: Algebraic Persistent Fault Analysis on Lightweight Block Ciphers
Persistent Fault Analysis (PFA) is a new fault analysis method for block ciphers proposed in CHES 2018, which utilizes those faults that persist in encryptions. However, one fact that has not been raised enough attention is that: while the fault itself does persist in the entire encryption, the corresponding statistical analysis merely leverages fault leakages in the last one or two rounds, which ignores the valuable leakages in deeper rounds. In this paper, we propose Algebraic Persistent Fault Analysis (APFA), which introduces algebraic analysis to facilitate PFA. APFA tries to make full usage of the free fault leakages in the deeper rounds without incurring additional fault injections. The core idea of APFA is to build similar algebraic constraints for the output of substitution layers and apply the constraints to as many rounds as possible. APFA has many advantages compared to PFA. First, APFA can bypass the manual deductions of round key dependencies along the fault propagation path and transfer the burdens to the computing power of machine solvers such as Crypto-MiniSAT. Second, thanks to the free leakages in the deeper round, APFA requires a much less number of ciphertexts than previous PFA methods, especially for those lightweight block ciphers such as PRESENT, LED, SKINNY, etc. Only 10 faulty ciphertexts are required to recover the master key of SKINNY-64-64, which is about 155 times of reduction as compared to the state-of-the-art result. Third, APFA can be applied to the block ciphers that cannot be analyzed by PFA due to the key size, such as PRESENT-128. Most importantly, APFA replaces statistical analysis with algebraic analysis, which opens a new direction for persistent-fault related researches.
2022
TCHES
GE vs GM: Efficient side-channel security evaluations on full cryptographic keys
Security evaluations for full cryptographic keys is a very important research topic since the past decade. An efficient rank estimation algorithm was proposed at FSE 2015 to approximate the empirical guessing entropy remaining after a side-channel attack on a full AES key, by combining information from attacks on each byte of he key independently. However, these could not easily scale to very large keys over 1024 bits. Hence, at CHES 2017, it was proposed a new approach for scalable security evaluations based on Massey’s guessing entropy, which was shown tight and scalable to very large keys, even beyond 8192 bits. Then, at CHES 2020, it was proposed a new method for estimating the empirical guessing entropy for the case of full-key evaluations, showing also important divergences between the empirical guessing entropy and Massey’s guessing entropy. However, there has been some confusion in recent publications of side-channel evaluation methods relying on these two variants of the guessing entropy. Furthermore, it remained an open problem to decide which of these methods should be used and in which context, particularly given the wide acceptance of the empirical guessing entropy in the side-channel community and the relatively little use of the other.In this paper, we tackle this open problem through several contributions. First of all, we provide an unitary presentation of both versions of the guessing entropy, allowing an easy comparison of the two metrics. Secondly, we compare the two metrics using a set of common and relevant indicators, as well as three different datasets for side-channel evaluations (simulated, AVR XMEGA 8-bit microcontroller and a 32-bit device). We used these indicators and datasets also to compare the three full-key evaluation methods from FSE 2015, CHES 2017 and CHES 2020, allowing us to provide a clear overview of the usefulness and limitations of each method. Furthermore, our analysis has enabled us to find a new method for verifying the soundness of a leakage model, by comparing both versions of the guessing entropy. This method can be easily extended to full-key evaluations, hence leading to a new useful method for side-channel evaluations.
2022
TCHES
Generic Hardware Private Circuits: Towards Automated Generation of Composable Secure Gadgets
With an increasing number of mobile devices and their high accessibility, protecting the implementation of cryptographic functions in the presence of physical adversaries has become more relevant than ever. Over the last decade, a lion’s share of research in this area has been dedicated to developing countermeasures at an algorithmic level. Here, masking has proven to be a promising approach due to the possibility of formally proving the implementation’s security solely based on its algorithmic description by elegantly modeling the circuit behavior. Theoretically verifying the security of masked circuits becomes more and more challenging with increasing circuit complexity. This motivated the introduction of security notions that enable masking of single gates while still guaranteeing the security when the masked gates are composed. Systematic approaches to generate these masked gates – commonly referred to as gadgets – were restricted to very simple gates like 2-input AND gates. Simply substituting such small gates by a secure gadget usually leads to a large overhead in terms of fresh randomness and additional latency (register stages) being introduced to the design.In this work, we address these problems by presenting a generic framework to construct trivially composable and secure hardware gadgets for arbitrary vectorial Boolean functions, enabling the transformation of much larger sub-circuits into gadgets. In particular, we present a design methodology to generate first-order secure masked gadgets which is well-suited for integration into existing Electronic Design Automation (EDA) tools for automated hardware masking as only the Boolean function expression is required. Furthermore, we practically verify our findings by conducting several case studies and show that our methodology outperforms various other masking schemes in terms of introduced latency or fresh randomness – especially for large circuits.
2022
TCHES
Guessing Bits: Improved Lattice Attacks on (EC)DSA with Nonce Leakage
The lattice reduction attack on (EC)DSA (and other Schnorr-like signature schemes) with partially known nonces, originally due to Howgrave-Graham and Smart, has been at the core of many concrete cryptanalytic works, side-channel based or otherwise, in the past 20 years. The attack itself has seen limited development, however: improved analyses have been carried out, and the use of stronger lattice reduction algorithms has pushed the range of practically vulnerable parameters further, but the lattice construction based on the signatures and known nonce bits remain the same.In this paper, we propose a new idea to improve the attack based on the same data in exchange for additional computation: carry out an exhaustive search on some bits of the secret key. This turns the problem from a single bounded distance decoding (BDD) instance in a certain lattice to multiple BDD instances in a fixed lattice of larger volume but with the same bound (making the BDD problem substantially easier). Furthermore, the fact that the lattice is fixed lets us use batch/preprocessing variants of BDD solvers that are far more efficient than repeated lattice reductions on non-preprocessed lattices of the same size. As a result, our analysis suggests that our technique is competitive or outperforms the state of the art for parameter ranges corresponding to the limit of what is achievable using lattice attacks so far (around 2-bit leakage on 160-bit groups, or 3-bit leakage on 256-bit groups).We also show that variants of this idea can also be applied to bits of the nonces (leading to a similar improvement) or to filtering signature data (leading to a data-time trade-off for the lattice attack). Finally, we use our technique to obtain an improved exploitation of the TPM–FAIL dataset similar to what was achieved in the Minerva attack.
2022
TCHES
High Order Side-Channel Security for Elliptic-Curve Implementations
Elliptic-curve implementations protected with state-of-the-art countermeasures against side-channel attacks might still be vulnerable to advanced attacks that recover secret information from a single leakage trace. The effectiveness of these attacks is boosted by the emergence of deep learning techniques for side-channel analysis which relax the control or knowledge an adversary must have on the target implementation. In this paper, we provide generic countermeasures to withstand these attacks for a wide range of regular elliptic-curve implementations. We first introduce a framework to formally model a regular algebraic program which consists of a sequence of algebraic operations indexed by key-dependent values. We then introduce a generic countermeasure to protect these types of programs against advanced single-trace side-channel attacks. Our scheme achieves provable security in the noisy leakage model under a formal assumption on the leakage of randomized variables. To demonstrate the applicability of our solution, we provide concrete examples on several widely deployed scalar multiplication algorithms and report some benchmarks for a protected implementation on a smart card.
2022
TCHES
High-order Polynomial Comparison and Masking Lattice-based Encryption
The main protection against side-channel attacks consists in computing every function with multiple shares via the masking countermeasure. For IND-CCA secure lattice-based encryption schemes, the masking of the decryption algorithm requires the high-order computation of a polynomial comparison. In this paper, we describe and evaluate a number of different techniques for such high-order comparison, always with a security proof in the ISW probing model. As an application, we describe the full high-order masking of the NIST standard Kyber, with a concrete implementation on ARM Cortex M architecture, and a t-test evaluation.
2022
TCHES
High-order Table-based Conversion Algorithms and Masking Lattice-based Encryption
Masking is the main countermeasure against side-channel attacks on embedded devices. For cryptographic algorithms that combine Boolean and arithmetic masking, one must therefore convert between the two types of masking, without leaking additional information to the attacker. In this paper we describe a new high-order conversion algorithm between Boolean and arithmetic masking, based on table recomputation, and provably secure in the ISW probing model. We show that our technique is particularly efficient for masking structured LWE encryption schemes such as Kyber and Saber. In particular, for Kyber IND-CPA decryption, we obtain an order of magnitude improvement compared to existing techniques.
2022
TCHES
Higher-Order DCA Attacks on White-Box Implementations with Masking and Shuffling Countermeasures
On white-box implementations, it has been proven that differential computation analysis (DCA) can recover secret keys without time-costly reverse engineering. At CHES 2021, Seker et al. combined linear and non-linear masking protections (SEL masking) to prevent sensitive variables from being predicted by DCA. At Eurocrypt 2021, Biryukov and Udovenko introduced a public dummy shuffling construction (BU shuffling) to protect sensitive functions. In this paper, we extend higher-order DCA (HO-DCA) to higher-degree context for exploiting the vulnerabilities against the state-of-the-art countermeasures. The data-dependency HO-DCA (DDHO-DCA), which is proposed at CHES 2020, is improved to successfully recover the correct key of SEL masking. In specific, our improved DDHO-DCA can also enhance the attack result of #100 which is the third winning challenge in WhibOx 2019. Since the XOR phase plays the same role as linear masking, we prove that a specific BU shuffling is vulnerable to HO-DCA attacks. Furthermore, we demonstrate that the combination of SEL masking and the specific BU shuffling still cannot defeat our higher-degree HO-DCA and improved DDHO-DCA attacks.
2022
TCHES
Higher-Order Masked Ciphertext Comparison for Lattice-Based Cryptography
Checking the equality of two arrays is a crucial building block of the Fujisaki-Okamoto transformation, and as such it is used in several post-quantum key encapsulation mechanisms including Kyber and Saber. While this comparison operation is easy to perform in a black box setting, it is hard to efficiently protect against side-channel attacks. For instance, the hash-based method by Oder et al. is limited to first-order masking, a higher-order method by Bache et al. was shown to be flawed, and a very recent higher-order technique by Bos et al. suffers in runtime. In this paper, we first demonstrate that the hash-based approach, and likely many similar first-order techniques, succumb to a relatively simple side-channel collision attack. We can successfully recover a Kyber512 key using just 6000 traces. While this does not break the security claims, it does show the need for efficient higher-order methods. We then present a new higher-order masked comparison algorithm based on the (insecure) higher-order method of Bache et al. Our new method is 4.2x, resp. 7.5x, faster than the method of Bos et al. for a 2nd, resp. 3rd, -order masking on the ARM Cortex-M4, and unlike the method of Bache et al., the new technique takes ciphertext compression into account. We prove correctness, security, and masking security in detail and provide performance numbers for 2nd and 3rd-order implementations. Finally, we verify our the side-channel security of our implementation using the test vector leakage assessment (TVLA) methodology.
2022
TCHES
Highly Vectorized SIKE for AVX-512
It is generally accepted that a large-scale quantum computer would be capable to break any public-key cryptosystem used today, thereby posing a serious threat to the security of the Internet’s public-key infrastructure. The US National Institute of Standards and Technology (NIST) addresses this threat with an open process for the standardization of quantum-safe key establishment and signature schemes, which is now in the final phase of the evaluation of candidates. SIKE (an abbreviation of Supersingular Isogeny Key Encapsulation) is one of the alternate candidates under evaluation and distinguishes itself from other candidates due to relatively short key lengths and relatively high computing costs. In this paper, we analyze how the latest generation of Intel’s Advanced Vector Extensions (AVX), in particular AVX-512IFMA, can be used to minimize the latency (resp. maximize the hroughput) of the SIKE key encapsulation mechanism when executed on Ice Lake CPUs based on the Sunny Cove microarchitecture. We present various techniques to parallelize and speed up the base/extension field arithmetic, point arithmetic, and isogeny computations performed by SIKE. All these parallel processing techniques are combined in AvxSike, a highly optimized implementation of SIKE using Intel AVX-512IFMA instructions. Our experiments indicate that AvxSike instantiated with the SIKEp503 parameter set is approximately 1.5 times faster than the to-date best AVX-512IFMA-based SIKE software from the literature. When executed on an Intel Core i3-1005G1 CPU, AvxSike outperforms the x64 assembly implementation of SIKE contained in Microsoft’s SIDHv3.4 library by a factor of about 2.5 for key generation and decapsulation, while the encapsulation is even 3.2 times faster.
2022
TCHES
ImpedanceVerif: On-Chip Impedance Sensing for System-Level Tampering Detection
Physical attacks can compromise the security of cryptographic devices. Depending on the attack’s requirements, adversaries might need to (i) place probes in the proximity of the integrated circuits (ICs) package, (ii) create physical connections between their probes/wires and the system’s PCB, or (iii) physically tamper with the PCB’s components, chip’s package, or substitute the entire PCB to prepare the device for the attack. While tamper-proof enclosures prevent and detect physical access to the system, their high manufacturing cost and incompatibility with legacy systems make them unattractive for many low-cost scenarios. In this paper, inspired by methods known from the field of power integrity analysis, we demonstrate how the impedance characterization of the system’s power distribution network (PDN) using on-chip circuit-based network analyzers can detect various classes of tamper events. We explain how these embedded network analyzers, without any modifications to the system, can be deployed on FPGAs to extract the frequency response of the PDN. The analysis of these frequency responses reveals different classes of tamper events from board to chip level. To validate our claims, we run an embedded network analyzer on FPGAs of a family of commercial development kits and perform extensive measurements for various classes of PCB and IC package tampering required for conducting different side-channel or fault attacks. Using the Wasserstein Distance as a statistical metric, we further show that we can confidently detect tamper events. Our results, interestingly, show that even environment-level tampering activities, such as the proximity of contactless EM probes to the IC package or slightly polished IC package, can be detected using on-chip impedance sensing.
2022
TCHES
Improved Plantard Arithmetic for Lattice-based Cryptography
This paper presents an improved Plantard’s modular arithmetic (Plantard arithmetic) tailored for Lattice-Based Cryptography (LBC). Based on the improved Plantard arithmetic, we present faster implementations of two LBC schemes, Kyber and NTTRU, running on Cortex-M4. The intrinsic advantage of Plantard arithmetic is that one multiplication can be saved from the modular multiplication of a constant. However, the original Plantard arithmetic is not very practical in LBC schemes because of the limitation on the unsigned input range. In this paper, we improve the Plantard arithmetic and customize it for the existing LBC schemes with theoretical proof. The improved Plantard arithmetic not only inherits its aforementioned advantage but also accepts signed inputs, produces signed output, and enlarges its input range compared with the original design. Moreover, compared with the state-of-the-art Montgomery arithmetic, the improved Plantard arithmetic has a larger input range and smaller output range, which allows better lazy reduction strategies during the NTT/INTT implementation in current LBC schemes. All these merits make it possible to replace the Montgomery arithmetic with the improved Plantard arithmetic in LBC schemes on some platforms. After applying this novel method to Kyber and NTTRU schemes using 16-bit NTT on Cortex-M4 devices, we show that the proposed design outperforms the known fastest implementation that uses Montgomery and Barrett arithmetic. Specifically, compared with the state-of-the-art Kyber implementation, applying the improved Plantard arithmetic in Kyber results in a speedup of 25.02% and 18.56% for NTT and INTT, respectively. Compared with the reference implementation of NTTRU, our NTT and INTT achieve speedup by 83.21% and 78.64%, respectively. As for the LBC KEM schemes, we set new speed records for Kyber and NTTRU running on Cortex-M4.
2022
TCHES
Information Theory-based Evolution of Neural Networks for Side-channel Analysis
Profiled side-channel analysis (SCA) leverages leakage from cryptographic implementations to extract the secret key. When combined with advanced methods in neural networks (NNs), profiled SCA can successfully attack even those cryptocores assumed to be protected against SCA. Despite the rise in the number of studies devoted to NN-based SCA, a range of questions has remained unanswered, namely: how to choose an NN with an adequate configuration, how to tune the NN’s hyperparameters, when to stop the training, etc. Our proposed approach, “InfoNEAT,” tackles these issues in a natural way. InfoNEAT relies on the concept of neural structure search, enhanced by information-theoretic metrics to guide the evolution, halt it with novel stopping criteria, and improve time-complexity and memory footprint. The performance of InfoNEAT is evaluated by applying it to publicly available datasets composed of real side-channel measurements. In addition to the considerable advantages regarding the automated configuration of NNs, InfoNEAT demonstrates significant improvements over other approaches for effective key recovery in terms of the number of epochs (e.g.,x6 faster) and the number of attack traces compared to both MLPs and CNNs (e.g., up to 1000s fewer traces to break a device) as well as a reduction in the number of trainable parameters compared to MLPs (e.g., by the factor of up to 32). Furthermore, through experiments, it is demonstrated that InfoNEAT’s models are robust against noise and desynchronization in traces.
2022
TCHES
Know Time to Die – Integrity Checking for Zero Trust Chiplet-based Systems Using Between-Die Delay PUFs
Industry trends are moving toward increasing use of chiplets as a replacement for monolithic fabrication in many modern chips. Each chiplet is a separately-produced silicon die, and a system-on-chip (SoC) is created by packaging the chiplets together on a silicon interposer or bridge. Chiplets enable IP reuse, heterogeneousintegration, and better ability to leverage cost-appropriate process nodes. Yet, creating systems from separately produced components also brings security risks to consider, such as the possibility of die swapping, or susceptibility to interposer probing or tampering. In a zero-trust security posture, a chiplet should not blindly assume it is operating in a friendly environment.In this paper we propose a delay-based PUF for chiplets to verify system integrity. Our technique allows a single chiplet to initiate a protocol with its neighbors to measure unique variations in the propagation delays of incoming signals as part of an integrity check. We prototype our design on Xilinx Ultrascale+ FPGAs, which are constructed as multi-die systems on a silicon interposer, and which also emulate the general features of other industrial chiplet interfaces. We perform experiments on, and compare data from, dozens of Ultrascale+ FPGAs by making use of Amazon’s Elastic Compute Cloud (EC2) F1 instances as a testing platform. The PUF cells are shown to reject clock and temperature variation as common mode, and each cell produces approximately 5 ps of unique delay variation. For a design with 144 PUF cells, we measure the mean within-class and between-class distances to be 68.3 ps and 847.7 ps, respectively. The smallest between-class distance of 686.0 ps exceeds the largest within-class distance of 124.0 ps by more than 5x under nominal conditions, and the PUF is shown to be resilient to environmental changes. Our findings indicate the PUF can be used for authentication, and is potentially sensitive enough to detect picosecond-scale timing changes due to tampering.
2022
TCHES
Low-Latency and Low-Randomness Second-Order Masked Cubic Functions
Masking schemes are the most popular countermeasure to mitigate Side-Channel Analysis (SCA) attacks. Compared to software, their hardware implementations require certain considerations with respect to physical defaults, such as glitches. To counter this extended leakage effect, the technique known as Threshold Implementation (TI) has proven to be a reliable solution. However, its efficiency, namely the number of shares, is tied to the algebraic degree of the target function. As a result, the application of TI may lead to unaffordable implementation costs. This dependency is relaxed by the successor schemes where the minimum number of d + 1 shares suffice for dth-order protection independent of the function’s algebraic degree. By this, although the number of input shares is reduced, the implementation costs are not necessarily low due to their high demand for fresh randomness. It becomes even more challenging when a joint low-latency and low-randomness cost is desired. In this work, we provide a methodology to realize the second-order glitch-extended probing-secure implementation of cubic functions with three shares while allowing to reuse fresh randomness. This enables us to construct low-latency second-order secure implementations of several popular lightweight block ciphers, including Skinny, Midori, and Prince, with a very limited number of fresh masks. Notably, compared to state-of-the-art equivalent implementations, our designs lower the latency in terms of the number of clock cycles while keeping randomness costs low.
2022
TCHES
Low-Latency Design and Implementation of the Squaring in Class Groups for Verifiable Delay Function Using Redundant Representation
A verifiable delay function (VDF) is a function whose evaluation requires running a prescribed number of sequential steps over a group while the result can be efficiently verified. As a kind of cryptographic primitives, VDFs have been adopted in rapidly growing applications for decentralized systems. For the security of VDFs in practical applications, it is widely agreed that the fastest implementation for the VDF evaluation, sequential squarings in a group of unknown order, should be publicly provided. To this end, we propose a possible minimum latency hardware implementation for the squaring in class groups by algorithmic and architectural level co-optimization. Firstly, low-latency architectures for large-number division, multiplication, and addition are devised using redundant representation, respectively. Secondly, we present two hardware-friendly algorithms which avoid time-consuming divisions involved in calculations related to the extended greatest common divisor (XGCD) and design the corresponding low-latency architectures. Besides, we schedule and reuse these computation modules to achieve good resource utilization by using compact instruction control. Finally, we code and synthesize the proposed design under the TSMC 28nm CMOS technology. The experimental results show that our design can achieve a speedup of 3.6x compared to the state-of-the-art implementation of the squaring in the class group. Moreover, compared to the optimal C++ implementation over an advanced CPU, our implementation is 9.1x faster.
2022
TCHES
Masked Accelerators and Instruction Set Extensions for Post-Quantum Cryptography
Side-channel attacks can break mathematically secure cryptographic systems leading to a major concern in applied cryptography. While the cryptanalysis and security evaluation of Post-Quantum Cryptography (PQC) have already received an increasing research effort, a cost analysis of efficient side-channel countermeasures is still lacking. In this work, we propose a masked HW/SW codesign of the NIST PQC finalists Kyber and Saber, suitable for their different characteristics. Among others, we present a novel masked ciphertext compression algorithm for non-power-of-two moduli. To accelerate linear performance bottlenecks, we developed a generic Number Theoretic Transform (NTT) multiplier, which, in contrast to previously published accelerators, is also efficient and suitable for schemes not based on NTT. For the critical non-linear operations, masked HW accelerators were developed, allowing a secure execution using RISC-V instruction set extensions. With the proposed design, we achieved a cycle count of K:214k/E:298k/D:313k for Kyber and K:233k/E:312k/D:351k for Saber with NIST Level III parameter sets. For the same parameter sets, the masking overhead for the first-order secure decapsulation operation including randomness generation is a factor of 4.48 for Kyber (D:1403k)and 2.60 for Saber (D:915k).
2022
TCHES
MCRank: Monte Carlo Key Rank Estimation for Side-Channel Security Evaluations
Key rank estimation provides a measure of the effort that the attacker has to spend bruteforcing the key of a cryptographic algorithm, after having gained some information from a side channel attack. We present MCRank, a novel method for key rank estimation based on Monte Carlo sampling. MCRank provides an unbiased estimate of the rank and a confidence interval. Its bounds rapidly become tight for increasing sample size, with a corresponding linear increase of the execution time. When applied to evaluate an AES-128 implementation, MCRank can be orders of magnitude faster than the state-of-the-art histogram-based enumeration method for comparable bound tightness. It also scales better than previous work for large keys, up to 2048 bytes. Besides its conceptual simplicity and efficiency, MCRank can assess for the first time the security of large keys even if the probability distributions given the side channel leakage are not independent between subkeys, which occurs, for example, when evaluating the leakage security of an AES-256 implementation.
2022
TCHES
Medha: Microcoded Hardware Accelerator for computing on Encrypted Data
Homomorphic encryption enables computation on encrypted data, and hence it has a great potential in privacy-preserving outsourcing of computations to the cloud. Hardware acceleration of homomorphic encryption is crucial as software implementations are very slow. In this paper, we present design methodologies for building a programmable hardware accelerator for speeding up the cloud-side homomorphic evaluations on encrypted data.First, we propose a divide-and-conquer technique that enables homomorphic evaluations in the polynomial ring RQ,2N = ZQ[x]/(x2N + 1) to use a hardware accelerator that has been built for the smaller ring RQ,N = ZQ[x]/(xN + 1). The technique makes it possible to use a single hardware accelerator flexibly for supporting several homomorphic encryption parameter sets.Next, we present several architectural design methods that we use to realize the flexible and instruction-set accelerator architecture, which we call ‘Medha’. At every level of the implementation hierarchy, we explore possibilities for parallel processing. Starting from hardware-friendly parallel algorithms for the basic building blocks, we gradually build heavily parallel RNS polynomial arithmetic units. Next, many of these parallel units are interconnected elegantly so that their interconnections require the minimum number of nets, therefore making the overall architecture placement-friendly on the platform. As homomorphic encryption is computation- as well as data-centric, the speed of homomorphic evaluations depends greatly on the way the data variables are handled. For Medha, we take a memory-conservative design approach and get rid of any off-chip memory access during homomorphic evaluations.Finally, we implement Medha in a Xilinx Alveo U250 FPGA and measure timing performances of the microcoded homomorphic addition, multiplication, key-switching, and rescaling routines for the leveled fully homomorphic encryption scheme RNSHEAAN at 200 MHz clock frequency. For the large parameter sets (log Q,N) = (438, 214) and (546, 215), Medha achieves accelerations by up to 68× and 78× times respectively compared to a highly optimized software implementation Microsoft SEAL running at 2.3 GHz.
2022
TCHES
MIRACLE: MIcRo-ArChitectural Leakage Evaluation: A study of micro-architectural power leakage across many devices
In this paper, we describe an extensible experimental infrastructure for evaluating the micro-architectural leakage, based on power consumption, that stems from a physical device. Building on existing literature, we use it to systematically study 14 different devices, which span 4 different instruction set architectures and 4 different vendors. The study allows a characterisation of each device with respect to any leakage effects stemming from sources within the micro-architectural implementation. We use it, for example, to identify and document several novel leakage effects (e.g., due to speculative instruction execution), and scenarios where an assumption about leakage is non-portable between different yet compatible devices.Ours is the widest study of its kind we are aware of, and highlights a range of challenges with respect to 1) the design, implementation, and evaluation of, e.g., masking schemes, 2) construction of accurate leakage models, and 3) selection of suitable devices for experimental research. For example, in relation to 1), we cast further doubt on whether a given device upholds the assumptions required by a given masking scheme; in relation to 2), we conclude that (statistical or formal) device leakage models must include information about the micro-architecture being modelled; in relation to 3), we claim the near mono-culture of devices that dominates existing literature is insufficient to support general claims regarding leakage. This is particularly important in the context of the FIPS 140-3 standard for non-invasive side-channel evaluation.
2022
TCHES
ModuloNET: Neural Networks Meet Modular Arithmetic for Efficient Hardware Masking
Intellectual Property (IP) thefts of trained machine learning (ML) models through side-channel attacks on inference engines are becoming a major threat. Indeed, several recent works have shown reverse engineering of the model internals using such attacks, but the research on building defenses is largely unexplored. There is a critical need to efficiently and securely transform those defenses from cryptography such as masking to ML frameworks. Existing works, however, revealed that a straightforward adaptation of such defenses either provides partial security or leads to high area overheads. To address those limitations, this work proposes a fundamentally new direction to construct neural networks that are inherently more compatible with masking. The key idea is to use modular arithmetic in neural networks and then efficiently realize masking, in either Boolean or arithmetic fashion, depending on the type of neural network layers. We demonstrate our approach on the edge-computing friendly binarized neural networks (BNN) and show how to modify the training and inference of such a network to work with modular arithmetic without sacrificing accuracy. We then design novel masking gadgets using Domain-Oriented Masking (DOM) to efficiently mask the unique operations of ML such as the activation function and the output layer classification, and we prove their security in the glitch-extended probing model. Finally, we implement fully masked neural networks on an FPGA, quantify that they can achieve a similar latency while reducing the FF and LUT costs over the state-of-the-art protected implementations by 34.2% and 42.6%, respectively, and demonstrate their first-order side-channel security with up to 1M traces.
2022
TCHES
Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4
The U.S. National Institute of Standards and Technology (NIST) has designated ARM microcontrollers as an important benchmarking platform for its Post-Quantum Cryptography standardization process (NISTPQC). In view of this, we explore the design space of the NISTPQC finalist Saber on the Cortex-M4 and its close relation, the Cortex-M3. In the process, we investigate various optimization strategies and memory-time tradeoffs for number-theoretic transforms (NTTs).Recent work by [Chung et al., TCHES 2021 (2)] has shown that NTT multiplication is superior compared to Toom–Cook multiplication for unprotected Saber implementations on the Cortex-M4 in terms of speed. However, it remains unclear if NTT multiplication can outperform Toom–Cook in masked implementations of Saber. Additionally, it is an open question if Saber with NTTs can outperform Toom–Cook in terms of stack usage. We answer both questions in the affirmative. Additionally, we present a Cortex-M3 implementation of Saber using NTTs outperforming an existing Toom–Cook implementation. Our stack-optimized unprotected M4 implementation uses around the same amount of stack as the most stack-optimized Toom–Cook implementation while being 33%-41% faster. Our speed-optimized masked M4 implementation is 16% faster than the fastest masked implementation using Toom–Cook. For the Cortex-M3, we outperform existing implementations by 29%-35% in speed. We conclude that for both stack- and speed-optimization purposes, one should base polynomial multiplications in Saber on the NTT rather than Toom–Cook for the Cortex-M4 and Cortex-M3. In particular, in many cases, multi-moduli NTTs perform best.
2022
TCHES
Multi-Parameter Support with NTTs for NTRU and NTRU Prime on Cortex-M4
We propose NTT implementations with each supporting at least one parameter of NTRU and one parameter of NTRU Prime. Our implementations are based on size-1440, size-1536, and size-1728 convolutions without algebraic assumptions on the target polynomial rings. We also propose several improvements for the NTT computation. Firstly, we introduce dedicated radix-(2, 3) butterflies combining Good–Thomas FFT and vector-radix FFT. In general, there are six dedicated radix-(2, 3) butterflies and they together support implicit permutations. Secondly, for odd prime radices, we show that the multiplications for one output can be replaced with additions/subtractions. We demonstrate the idea for radix-3 and show how to extend it to any odd prime. Our improvement also applies to radix-(2, 3) butterflies. Thirdly, we implement an incomplete version of Good–Thomas FFT for addressing potential code size issues. For NTRU, our polynomial multiplications outperform the state-of-the-art by 2.8%−10.3%. For NTRU Prime, our polynomial multiplications are slower than the state-of-the-art. However, the SotA exploits the specific structure of coefficient rings or polynomial moduli, while our NTT-based multiplications exploit neither and apply across different schemes. This reduces the engineering effort, including testing and verification.
2022
TCHES
Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1
We present new speed records on the Armv8-A architecture for the latticebased schemes Dilithium, Kyber, and Saber. The core novelty in this paper is the combination of Montgomery multiplication and Barrett reduction resulting in “Barrett multiplication” which allows particularly efficient modular one-known-factor multiplication using the Armv8-A Neon vector instructions. These novel techniques combined with fast two-unknown-factor Montgomery multiplication, Barrett reduction sequences, and interleaved multi-stage butterflies result in significantly faster code. We also introduce “asymmetric multiplication” which is an improved technique for caching the results of the incomplete NTT, used e.g. for matrix-to-vector polynomial multiplication. Our implementations target the Arm Cortex-A72 CPU, on which our speed is 1.7× that of the state-of-the-art matrix-to-vector polynomial multiplication in kyber768 [Nguyen–Gaj 2021]. For Saber, NTTs are far superior to Toom–Cook multiplication on the Armv8-A architecture, outrunning the matrix-to-vector polynomial multiplication by 2.0×. On the Apple M1, our matrix-vector products run 2.1× and 1.9× faster for Kyber and Saber respectively.
2022
TCHES
On Efficient and Secure Code-based Masking: A Pragmatic Evaluation
Code-based masking is a highly generalized type of masking schemes, which can be instantiated into specific cases by assigning different encoders. It captivates by its side-channel resistance against higher-order attacks and the potential to withstand fault injection attacks. However, similar to other algebraically-involved masking schemes, code-based masking is also burdened with expensive computational overhead. To mitigate such cost and make it efficient, we contribute to several improvements to the original scheme proposed by Wang et al. in TCHES 2020. Specifically, we devise a computationally friendly encoder and accordingly accelerate masked gadgets to leverage efficient implementations. In addition, we highlight that the amortization technique introduced by Wang et al. does not always lead to efficient implementations as expected, but actually decreases the efficiency in some cases.From the perspective of practical security, we carry out an extensive evaluation of the concrete security of code-based masking in the real world. On one hand, we select three representative variations of code-based masking as targets for an extensive evaluation. On the other hand, we aim at security assessment of both encoding and computations to investigate whether the state-of-the-art computational framework for code-based masking reaches the security of the corresponding encoding. By leveraging both leakage assessment tool and side-channel attacks, we verify the existence of “security order amplification” in practice and validate the reliability of the leakage quantification method proposed by Cheng et al. in TCHES 2021. In addition, we also study the security decrease caused by the “cost amortization” technique and redundancy of code-based masking. We identify a security bottleneck in the gadgets computations which limits the whole masked implementation. To the best of our knowledge, this is the first time that allows us to narrow down the gap between the theoretical security order under the probing model (sometimes with simulation experiments) and the concrete side-channel security level of protected implementations by code-based masking in practice.
2022
TCHES
On the application of Two-Photon Absorption for Laser Fault Injection attacks: Pushing the physical boundaries for Laser-based Fault Injection
Laser Fault Injection (LFI) is considered to be the most powerful semiinvasive fault injection method for implementation attacks on security devices. In this work we discuss for the first time the application of the nonlinear Two-Photon Absorption (TPA) effect for the purpose of LFI. Though TPA is an established technique in other areas, e.g. fluorescence microscopy, so far it did not receive any attention in the field of physical attack methods on integrated circuits. We show that TPA has several superior properties over the regular linear LFI method. The TPA effect allows to work on non-thinned devices without increasing the induced energy and hence the stress on the device. In contrast to regular LFI, the nonlinearity of the TPA effect leads to increased precision due to the steeper descent in intensity and also a vertically restricted photoelectric effect. By practical experiments, we demonstrate the general applicability of the method for a specific device and that unlike a regular LFI setup, TPA-LFI is capable to inject faults without triggering a latch-up effect. In addition we discuss the possible implications of TPA-LFI on various sensor-based countermeasures.
2022
TCHES
One Truth Prevails: A Deep-learning Based Single-Trace Power Analysis on RSA–CRT with Windowed Exponentiation
In this paper, a deep-learning based power/EM analysis attack on the state-of-the-art RSA–CRT software implementation is proposed. Our method is applied to a side-channel-aware implementation with the Gnu Multi-Precision (MP) Library, which is a typical open-source software library. Gnu MP employs a fixed-window exponentiation, which is the fastest in a constant time, and loads the entire precomputation table once to avoid side-channel leaks from multiplicands. To conduct an accurate estimation of secret exponents, our method focuses on the process of loading the entire precomputation table, which we call a dummy load scheme. It is particularly noteworthy that the dummy load scheme is implemented as a countermeasure against a simple power/EM analysis (SPA/SEMA). This type of vulnerability from a dummy load scheme also exists in other cryptographic libraries. We also propose a partial key exposure attack suitable for the distribution of errors inthe secret exponents recovered from the windowed exponentiation. We experimentally show that the proposed method consisting of the above power/EM analysis attack, as well as a partial key exposure attack, can be used to fully recover the secret key of the RSA–CRT from the side-channel information of a single decryption or a signature process.
2022
TCHES
Perceived Information Revisited: New Metrics to Evaluate Success Rate of Side-Channel Attacks
In this study, we present new analytical metrics for evaluating the performance of side-channel attacks (SCAs) by revisiting the perceived information (PI), which is defined using cross-entropy (CE). PI represents the amount of information utilized by a probability distribution that determines a distinguishing rule in SCA. Our analysis partially solves an important open problem in the performance evaluation of deep-learning based SCAs (DL-SCAs) that the relationship between neural network (NN) model evaluation metrics (such as accuracy, loss, and recall) and guessing entropy (GE)/success rate (SR) is unclear. We first theoretically show that the conventional CE/PI is non-calibrated and insufficient for evaluating the SCA performance, as it contains uncertainty in terms of SR. More precisely, we show that an infinite number of probability distributions with different CE/PI can achieve an identical SR. With the above analysis result, we present a modification of CE/PI, named effective CE/PI (ECE/EPI), to eliminate the above uncertainty. The ECE/EPI can be easily calculated for a given probability distribution and dataset, which would be suitable for DL-SCA. Using the ECE/EPI, we can accurately evaluate the SR hrough the validation loss in the training phase, and can measure the generalization of the NN model in terms of SR in the attack phase. We then analyze and discuss the proposed metrics regarding their relationship to SR, conditions of successful attacks for a distinguishing rule with a probability distribution, a statistic/asymptotic aspect, and the order of key ranks in SCA. Finally, we validate the proposed metrics through experimental attacks on masked AES implementations using DL-SCA.
2022
TCHES
Polynomial multiplication on embedded vector architectures
High-degree, low-precision polynomial arithmetic is a fundamental computational primitive underlying structured lattice based cryptography. Its algorithmic properties and suitability for implementation on different compute platforms is an active area of research, and this article contributes to this line of work: Firstly, we present memory-efficiency and performance improvements for the Toom-Cook/Karatsuba polynomial multiplication strategy. Secondly, we provide implementations of those improvements on Arm® Cortex®-M4 CPU, as well as the newer Cortex-M55 processor, the first M-profile core implementing the M-profile Vector Extension (MVE), also known as Arm® Helium™ technology. We also implement the Number Theoretic Transform (NTT) on the Cortex-M55 processor. We show that despite being singleissue, in-order and offering only 8 vector registers compared to 32 on A-profile SIMD architectures like Arm® Neon™ technology and the Scalable Vector Extension (SVE), by careful register management and instruction scheduling, we can obtain a 3× to 5× performance improvement over already highly optimized implementations on Cortex-M4, while maintaining a low area and energy profile necessary for use in embedded market. Finally, as a real-world application we integrate our multiplication techniques to post-quantum key-encapsulation mechanism Saber
2022
TCHES
Post-Quantum Authenticated Encryption against Chosen-Ciphertext Side-Channel Attacks
Over the last years, the side-channel analysis of Post-Quantum Cryptography (PQC) candidates in the NIST standardization initiative has received increased attention. In particular, it has been shown that some post-quantum Key Encapsulation Mechanisms (KEMs) are vulnerable to Chosen-Ciphertext Side-Channel Attacks (CC-SCA). These powerful attacks target the re-encryption step in the Fujisaki-Okamoto (FO) transform, which is commonly used to achieve CCA security in such schemes. To sufficiently protect PQC KEMs on embedded devices against such a powerful CC-SCA, masking at increasingly higher order is required, which induces a considerable overhead. In this work, we propose to use a conceptually simple construction, the ΕtS KEM, that alleviates the impact of CC-SCA. It uses the Encrypt-then-Sign (EtS) paradigm introduced by Zheng at ISW ’97 and further analyzed by An, Dodis and Rabin at EUROCRYPT ’02, and instantiates a postquantum authenticated KEM in the outsider-security model. While the construction is generic, we apply it to the CRYSTALS-Kyber KEM, relying on the CRYSTALSDilithium and Falcon signature schemes. We show that a CC-SCA-protected EtS KEM version of CRYSTALS-Kyber requires less than 10% of the cycles required for the CC-SCA-protected FO-based KEM, at the cost of additional data/communication overhead. We additionally show that the cost of protecting the EtS KEM against fault injection attacks, necessarily due to the added signature verification, remains negligible compared to the large cost of masking the FO transform at higher orders. Lastly, we discuss relevant embedded use cases for our EtS KEM construction.
2022
TCHES
Practical Multiple Persistent Faults Analysis
We focus on the multiple persistent faults analysis in this paper to fill existing gaps in its application in a variety of scenarios. Our major contributions are twofold. First, we propose a novel technique to apply persistent fault apply in the multiple persistent faults setting that decreases the number of survived keys and the required data. We demonstrate that by utilizing 1509 and 1448 ciphertexts, the number of survived keys after performing persistent fault analysis on AES in the presence of eight and sixteen faults can be reduced to only 29 candidates, whereas the best known attacks need 2008 and 1643 ciphertexts, respectively, with a time complexity of 250. Second, we develop generalized frameworks for retrieving the key in the ciphertext-only model. Our methods for both performing persistent fault attacks and key-recovery processes are highly flexible and provide a general trade-off between the number of required ciphertexts and the time complexity. To break AES with 16 persistent faults in the Sbox, our experiments show that the number of required ciphertexts can be decreased to 477 while the attack is still practical with respect to the time complexity. To confirm the accuracy of our methods, we performed several simulations as well as experimental validations on the ARM Cortex-M4 microcontroller with electromagnetic fault injection on AES and LED, which are two well-known block ciphers to validate the types of faults and the distribution of the number of faults in practice.
2022
TCHES
2022
TCHES
PreMSat: Preventing Magnetic Saturation Attack on Hall Sensors
Spoofing a passive Hall sensor with fake magnetic fields can inject false data into the downstream of connected systems. Several works have tried to provide a defense against the intentional spoofing to different sensors over the last six years. However, they either only work on active sensors or against externally injected unwanted weak signals (e.g., EMIs, acoustics, ultrasound, etc.), which can only spoof sensor output in its linear region. However, they do not work against a strong magnetic spoofing attack that can drive the passive Hall sensor output in its saturation region. We name this as the saturation attack. In the saturation region, the output gets flattened, and no information can be retrieved, resulting in a denial-of-service attack on the sensor.Our work begins to fill this gap by providing a defense named PreMSat against the saturation attack on passive Hall sensors. The core idea behind PreMSat is that it cangenerate an internal magnetic field having the same strength but in opposite polarity to external magnetic fields injected by an attacker. Therefore, the generated internal magnetic field by PreMSat can nullify the injected external field while preventing: (i) intentional spoofing in the sensor’s linear region, and (ii) saturation attack in the saturation region. PreMSat integrates a low-resistance magnetic path to collect the injected external magnetic fields and utilizes a finely tuned PID controller to nullify the external fields in real-time. PreMSat can prevent the magnetic saturation attack having a strength up to ∼4200 A-t within a frequency range of 0 Hz–30 kHz with low cost (∼$14), whereas the existing works cannot prevent saturation attacks with any strength. Moreover, it works against saturation attacks originating from any type, such as constant, sinusoidal, and pulsating magnetic fields. We did over 300 experiments on ten different industry-used Hall sensors from four different manufacturers to prove the efficacy of PreMSat and found that the correlation coefficient between the signals before the attack and after the attack is greater than 0.94 in every test case. Moreover, we create a prototype of PreMSat and evaluate its performance in a practical system — a grid-tied solar inverter. We find that PreMSat can satisfactorily prevent the saturation attack on passive Hall sensors in real-time.
2022
TCHES
PROLEAD: A Probing-Based Hardware Leakage Detection Tool
Even today, Side-Channel Analysis attacks pose a serious threat to the security of cryptographic implementations fabricated with low-power and nanoscale feature technologies. Fortunately, the masking countermeasures offer reliable protection against such attacks based on simple security assumptions. However, the practical application of masking to a cryptographic algorithm is not trivial, and the designer may overlook possible security flaws, especially when masking a complex circuit. Moreover, abstract models like probing security allow formal verification tools to evaluate masked implementations. However, this is computationally too expensive when dealing with circuits that are not based on composable gadgets. Unfortunately, using composable gadgets comes at some area overhead. As a result, such tools can only evaluate subcircuits, not their compositions, which can become the Achilles’ heel of such masked implementations.In this work, we apply logic simulations to evaluate the security of masked implementations which are not necessarily based on composable gadgets. We developed PROLEAD, an automated tool analyzing the statistical independence of simulated intermediates probed by a robust probing adversary. Compared to the state of the art, our approach (1) does not require any power model as only the state of a gate-level netlist is simulated, (2) can handle masked full cipher implementations, and (3) can detect flaws related to the combined occurrence of glitches and transitions as well as higher-order multivariate leakages. With PROLEAD, we can evaluate masked mplementations that are too complex for existing formal verification tools while being in line with the robust probing model. Through PROLEAD, we have detected security flaws in several publicly-available masked implementations, which have been claimed to be robust probing secure.
2022
TCHES
Quantum Period Finding against Symmetric Primitives in Practice
We present the first complete descriptions of quantum circuits for the offline Simon’s algorithm, and estimate their cost to attack the MAC Chaskey, the block cipher PRINCE and the NIST lightweight finalist AEAD scheme Elephant. These attacks require a reasonable amount of qubits, comparable to the number of qubits required to break RSA-2048. They are faster than other collision algorithms, and the attacks against PRINCE and Chaskey are the most efficient known to date. As Elephant has a key smaller than its state size, the algorithm is less efficient and its cost ends up very close to or above the cost of exhaustive search.We also propose an optimized quantum circuit for boolean linear algebra as well as complete reversible implementations of PRINCE, Chaskey, spongent and Keccak which are of independent interest for quantum cryptanalysis. We stress that our attacks could be applied in the future against today’s communications, and recommend caution when choosing symmetric constructions for cases where long-term security is expected.
2022
TCHES
Racing BIKE: Improved Polynomial Multiplication and Inversion in Hardware
BIKE is a Key Encapsulation Mechanism selected as an alternate candidate in NIST’s PQC standardization process, in which performance plays a significant role in the third round. This paper presents FPGA implementations of BIKE with the best area-time performance reported in literature. We optimize two key arithmetic operations, which are the sparse polynomial multiplication and the polynomial inversion. Our sparse multiplier achieves time-constancy for sparse polynomials of indefinite Hamming weight used in BIKE’s encapsulation. The polynomial inversion is based on the extended Euclidean algorithm, which is unprecedented in current BIKE implementations. Our optimized design results in a 5.5 times faster key generation compared to previous implementations based on Fermat’s little theorem.Besides the arithmetic optimizations, we present a united hardware design of BIKE with shared resources and shared sub-modules among KEM functionalities. On Xilinx Artix-7 FPGAs, our light-weight implementation consumes only 3 777 slices and performs a key generation, encapsulation, and decapsulation in 3 797 μs, 443 μs, and 6 896 μs, respectively. Our high-speed design requires 7 332 slices and performs the three KEM operations in 1 672 μs, 132 μs, and 1 892 μs, respectively.
2022
TCHES
Randomness Optimization for Gadget Compositions in Higher-Order Masking
Physical characteristics of electronic devices, leaking secret and sensitive information to an adversary with physical access, pose a long-known threat to cryptographic hardware implementations. Among a variety of proposed countermeasures against such Side-Channel Analysis attacks, masking has emerged as a promising, but often costly, candidate. Furthermore, the manual realization of masked implementations has proven error-prone and often introduces flaws, possibly resulting in insecure circuits. In the context of automatic masking, a new line of research emerged, aiming to replace each physical gate with a secure gadget that fulfills well-defined properties, guaranteeing security when interconnected to a large circuit. Unfortunately, those gadgets introduce a significant amount of additional overhead into the design, in terms of area, latency, and randomness requirements.In this work, we present a novel approach to reduce the demands for randomness in such gadget-composed circuits by reusing randomness across gadgets while maintaining security in the probing adversary model. To this end, we embedded the corresponding optimization passes into an Electronic Design Automation toolchain, able to construct, optimize, and implement masked circuits, starting from an unprotected design. As such, our security-aware optimization offers an additional building block for existing or new Electronic Design Automation frameworks, where security is considered a first-class design constraint.
2022
TCHES
Redshift: Manipulating Signal Propagation Delay via Continuous-Wave Lasers
We propose a new laser injection attack Redshift that manipulates signal propagation delay, allowing for precise control of oscillator frequencies and other behaviors in delay-sensitive circuits. The target circuits have a significant sensitivity to light, and a low-power continuous-wave laser, similar to a laser pointer, is sufficient for the attack. This is in contrast to previous fault injection attacks that use highpowered laser pulses to flip digital bits. This significantly reduces the cost of the attack and extends the range of possible attackers. Moreover, the attack potentially evades sensor-based countermeasures configured for conventional pulse lasers. To demonstrate Redshift, we target ring-oscillator and arbiter PUFs that are used in cryptographic applications. By precisely controlling signal propagation delays within these circuits, an attacker can control the output of a PUF to perform a state-recovery attack and reveal a secret key. We finally discuss the physical causality of the attack and potential countermeasures.
2022
TCHES
Redundancy AES Masking Basis for Attack Mitigation (RAMBAM)
In this work, we present RAMBAM, a novel concept of designing countermeasures against side-channel attacks and the Statistical Ineffective Fault Attack (specifically SIFA-1) on AES that employs redundant representations of finite field elements. From this concept, we derive a family of protected hardware implementations of AES. A fundamental property of RAMBAM is a security parameter d that along with other attributes of the scheme allows for making trade-offs between gate count, maximal frequency, performance, level of robustness to the first and higher-order side-channel attacks, and protection against SIFA-1. We present an analytical model that explains how the scheme reduces the leakage and how the design choices affect it. Furthermore, we demonstrate experimentally how different design choices achieve the required goals. In particular, the compact version exhibits a gate count as low as 12.075 kGE, while maintaining adequate protection. The performance-oriented version provides latency as low as one round per cycle, thus combining protection against SCA and SIFA-1 with high performance which is one of the original design goals of AES. Finally, we assess the leakage of the scheme for the first and the second (bivariate) orders using TVLA methodology on an FPGA implementation and observe resilience to at least 348M traces with 16 Sboxes.
2022
TCHES
Riding the Waves Towards Generic Single-Cycle Masking in Hardware
Research on the design of masked cryptographic hardware circuits in the past has mostly focused on reducing area and randomness requirements. However, many embedded devices like smart cards and IoT nodes also need to meet certain performance criteria, which is why the latency of masked hardware circuits also represents an important metric for many practical applications.The root cause of latency in masked hardware circuits is the need for additional register stages that synchronize the propagation of shares. Otherwise, glitches would violate the basic assumptions of the used masking scheme. This issue can be addressed to some extent, e.g., by using lightweight cryptographic algorithms with low-degree Sboxes, however, many applications still require the usage of schemes with higher-degree S-boxes like AES. Several recent works have already proposed solutions that help reduce this latency yet they either come with noticeably increased area/randomness requirements, limitations on masking orders, or specific assumptions on the general architecture of the crypto core.In this work, we introduce a generic and efficient method for designing single-cycle glitch-resistant (higher-order) masked hardware of cryptographic S-boxes. We refer to this technique as (generic) Self-Synchronized Masking (“SESYM”). The main idea of our approach is to replace register stages with a partial dual-rail encoding of masked signals that ensures synchronization within the circuit. More concretely, we show that WDDL gates and Muller C-elements can be used in combination with standard masking schemes to design single-cycle S-box circuits that, especially in case of higher-degree S-boxes, have noticeably lower requirements in terms of area and online randomness. We apply our method to DOM-based S-boxes of Ascon and AES and compare the resulting circuits to existing latency optimized circuits based on TI, GLM, and LMDPL. The latency of all three designs is reduced to single-cycle operation and are dth-order secure. Compared to GLM-masked Ascon, our approach comes with a 6.4 times reduction in online randomness for all protection orders. Compared to 1st-order LMDPL-masked AES, our approach achieves comparable results, while it is more generic, amongst others, by also supporting higher-order designs. We also underline the practical protection of our constructions against power analysis attacks via empirical and formal verification approaches.
2022
TCHES
RISC-V Instruction Set Extensions for Lightweight Symmetric Cryptography
The NIST LightWeight Cryptography (LWC) selection process aims to standardise cryptographic functionality which is suitable for resource-constrained devices. Since the outcome is likely to have significant, long-lived impact, careful evaluation of each submission with respect to metrics explicitly outlined in the call is imperative. Beyond the robustness of submissions against cryptanalytic attack, metrics related to their implementation (e.g., execution latency and memory footprint) form an important example. Aiming to provide evidence allowing richer evaluation with respect to such metrics, this paper presents the design, implementation, and evaluation of one separate Instruction Set Extension (ISE) for each of the 10 LWC final round submissions, namely Ascon, Elephant, GIFT-COFB, Grain-128AEADv2, ISAP, PHOTON-Beetle, Romulus, Sparkle, TinyJAMBU, and Xoodyak; although we base the work on use of RISC-V, we argue that it provides more general insight.
2022
TCHES
Risky Translations: Securing TLBs against Timing Side Channels
Microarchitectural side-channel vulnerabilities in modern processors are known to be a powerful attack vector that can be utilized to bypass common security boundaries like memory isolation. As shown by recent variants of transient execution attacks related to Spectre and Meltdown, those side channels allow to leak data from the microarchitecture to the observable architectural state. The vast majority of attacks currently build on the cache-timing side channel, since it is easy to exploit and provides a reliable, fine-grained communication channel. Therefore, many proposals for side-channel secure cache architectures have been made. However, caches are not the only source of side-channel leakage in modern processors and mitigating the cache side channel will inevitably lead to attacks exploiting other side channels. In this work, we focus on defeating side-channel attacks based on page translations.It has been shown that the Translation Lookaside Buffer (TLB) can be exploited in a very similar fashion to caches. Since the main caches and the TLB share many features in their architectural design, the question arises whether existing countermeasures against cache-timing attacks can be used to secure the TLB. We analyze state-ofthe-art proposals for side-channel secure cache architectures and investigate their applicability to TLB side channels. We find that those cache countermeasures are notdirectly applicable to TLBs, and propose TLBcoat, a new side-channel secure TLB architecture. We provide evidence of TLB side-channel leakage on RISC-V-based Linux systems, and demonstrate that TLBcoat prevents this leakage. We implement TLBcoat using the gem5 simulator and evaluate its performance using the PARSEC benchmark suite.
2022
TCHES
Roulette: A Diverse Family of Feasible Fault Attacks on Masked Kyber
At Indocrypt 2021, Hermelink, Pessl, and Pöppelmann presented a fault attack against Kyber in which a system of linear inequalities over the private key is generated and solved. The attack requires a laser and is, understandably, demonstrated with simulations—not actual equipment. We facilitate and diversify the attack in four ways, thereby admitting cheaper and more forgiving fault-injection setups. Firstly, the attack surface is enlarged: originally, the two input operands of the ciphertext comparison are covered, and we additionally cover re-encryption modules such as binomial sampling and butterflies in the last layer of the inverse numbertheoretic transform (INTT). This extra surface also allows an attacker to bypass the custom countermeasure that was proposed in the Indocrypt paper. Secondly, the fault model is relaxed: originally, precise bit flips are required, and we additionally support set-to-0 faults, random faults, arbitrary bit flips, and instruction skips. Thirdly, masking and blinding methods that randomize intermediate variables kindly help our attack, whereas the IndoCrypt attack is like most other fault attacks either hindered or unaltered by countermeasures against passive side-channel analysis (SCA). Randomization helps because we randomly fault intermediate prime-field elements until a desired set of values is hit. If these prime-field elements are represented on a circle, which is a common visualization, our attack is analogous to spinning a roulette wheel until the ball lands in a desired set of pockets. Hence, the nickname. Fourthly, we accelerate and improve the error tolerance of solving the system of linear inequalities: run times of roughly 100 minutes are reduced to roughly one minute, and inequality error rates of roughly 1% are relaxed to roughly 25%. Benefiting from the four advances above, we use a reasonably priced ChipWhisperer® board to break a masked implementation of Kyber running on an ARM Cortex-M4 through clock glitching.
2022
TCHES
Semi-Automatic Locating of Cryptographic Operations in Side-Channel Traces
Locating a cryptographic operation in a side-channel trace, i.e. finding out where it is in the time domain, without having a template, can be a tedious task even for unprotected implementations. The sheer amount of data can be overwhelming. In a simple call to OpenSSL for AES-128 ECB encryption of a single data block, only 0.00028% of the trace relate to the actual AES-128 encryption. The rest is overhead. We introduce the (to our best knowledge) first method to locate a cryptographic operation in a side-channel trace in a largely automated fashion. The method exploits meta information about the cryptographic operation and requires an estimate of its implementation’s execution time.The method lends itself to parallelization and our implementation in a tool greatly benefits from GPU acceleration. The tool can be used offline for trace segmentation and for generating a template which can then be used online in real-time waveformmatching based triggering systems for trace acquisition or fault injection. We evaluate it in six scenarios involving hardware and software implementations of different cryptographic operations executed on diverse platforms. Two of these scenarios cover realistic protocol level use-cases and demonstrate the real-world applicability of our tool in scenarios where classical leakage-detection techniques would not work. The results highlight the usefulness of the tool because it reliably and efficiently automates the task and therefore frees up time of the analyst.The method does not work on traces of implementations protected by effective time randomization countermeasures, e.g. random delays and unstable clock frequency, but is not affected by masking, shuffling and similar countermeasures.
2022
TCHES
Side Channel Attack On Stream Ciphers: A Three-Step Approach To State/Key Recovery
Side Channel Attack (SCA) exploits the physical information leakage (such as electromagnetic emanation) from a device that performs some cryptographic operation and poses a serious threat in the present IoT era. In the last couple of decades, there have been a large body of research works dedicated to streamlining/improving the attacks or suggesting novel countermeasures to thwart those attacks. However, a closer inspection reveals that a vast majority of published works in the context of symmetric key cryptography is dedicated to block ciphers (or similar designs). This leaves the problem for the stream ciphers wide open. There are few works here and there, but a generic and systematic framework appears to be missing from the literature. Motivating by this observation, we explore the problem of SCA on stream ciphers with extensive details. Loosely speaking, our work picks up from the recent TCHES’21 paper by Sim, Bhasin and Jap. We present a framework by extending the efficiency of their analysis, bringing it into more practical terms.In a nutshell, we develop an automated framework that works as a generic tool to perform SCA on any stream cipher or a similar structure. It combines multiple automated tools (such as, machine learning, mixed integer linear programming, satisfiability modulo theory) under one umbrella, and acts as an end-to-end solution (taking side channel traces and returning the secret key). Our framework efficiently handles noisy data and works even after the cipher reaches its pseudo-random state. We demonstrate its efficacy by taking electromagnetic traces from a 32-bit software platform and performing SCA on a high-profile stream cipher, TRIVIUM, which is also an ISO standard. We show pragmatic key recovery on TRIVIUM during its initialization and also after the cipher reaches its pseudo-random state (i.e., producing key-stream).
2022
TCHES
Side-Channel Expectation-Maximization Attacks
Block ciphers are protected against side-channel attacks by masking. On one hand, when the leakage model is unknown, second-order correlation attacks are typically used. On the other hand, when the leakage model can be profiled, template attacks are prescribed. But what if the profiled model does not exactly match that of the attacked device?One solution consists in regressing on-the-fly the scaling parameters from the model. In this paper, we leverage an Expectation-Maximization (EM) algorithm to implement such an attack. The resulting unprofiled EM attack, termed U-EM, is shown to be both efficient (in terms of number of traces) and effective (computationally speaking). Based on synthetic and real traces, we introduce variants of our U-EM attack to optimize its performance, depending on trade-offs between model complexity and epistemic noise. We show that the approach is flexible, in that it can easily be adapted to refinements such as different points of interest and number of parameters in the leakage model.
2022
TCHES
Side-Channel Masking with Common Shares
To counter side-channel attacks, a masking scheme randomly encodes keydependent variables into several shares, and transforms operations into the masked correspondence (called gadget) operating on shares. This provably achieves the de facto standard notion of probing security.We continue the long line of works seeking to reduce the overhead of masking. Our main contribution is a new masking scheme over finite fields in which shares of different variables have a part in common. This enables the reuse of randomness / variables across different gadgets, and reduces the total cost of masked implementation. For security order d and circuit size l, the randomness requirement and computational complexity of our scheme are Õ(d2) and Õ(ld2) respectively, strictly improving upon the state-of-the-art Õ(d2) and Õ(ld3) of Coron et al. at Eurocrypt 2020.A notable feature of our scheme is that it enables a new paradigm in which many intermediates can be precomputed before executing the masked function. The precomputation consumes Õ(ld2) and produces Õ(ld) variables to be stored in RAM. The cost of subsequent (online) computation is reduced to Õ(ld), effectively speeding up e.g., challenge-response authentication protocols. We showcase our method on the AES on ARM Cortex M architecture and perform a T-test evaluation. Our results show a speed-up during the online phase compared with state-of-the-art implementations, at the cost of acceptable RAM consumption and precomputation time.To prove security for our scheme, we propose a new security notion intrinsically supporting randomness / variables reusing across gadgets, and bridging the security of parallel compositions of gadgets to general compositions, which may be of independent interest.
2022
TCHES
SIKE Channels: Zero-Value Side-Channel Attacks on SIKE
We present new side-channel attacks on SIKE, the isogeny-based candidate in the NIST PQC competition. Previous works had shown that SIKE is vulnerable to differential power analysis, and pointed to coordinate randomization as an effective countermeasure. We show that coordinate randomization alone is not sufficient, because SIKE is vulnerable to a class of attacks similar to refined power analysis in elliptic curve cryptography, named zero-value attacks. We describe and confirm in the lab two such attacks leading to full key recovery, and analyze their countermeasures.
2022
TCHES
Single-Trace Side-Channel Attacks on the Toom-Cook: The Case Study of Saber
The Toom-Cook method is a well-known strategy for building algorithms to multiply polynomials efficiently. Along with NTT-based polynomial multiplication, Toom-Cook-based or Karatsuba-based polynomial multiplication algorithms still have regained attention since the start of the NIST’s post-quantum standardization procedure. Compared to the comprehensive analysis done for NTT, the leakage characteristics of Toom-Cook have not been discussed. We analyze the vulnerabilities of Toom-Cook in the reference implementation of Saber, a third round finalist of NIST’s post-quantum standardization process. In this work, we present the first single-trace attack based on the soft-analytical side-channel attack (SASCA) targeting the Toom-Cook. The deep learning-based power analysis is combined with SASCA to decrease the number of templates since there are a large number of similar operations in the Toom-Cook. Moreover, we describe the optimized factor graph and improved belief propagation to make the attack more practical. The feasibility of the attack is verified by evaluation experiments. We also discuss the possible countermeasures to prevent the attack.
2022
TCHES
SIPFA: Statistical Ineffective Persistent Faults Analysis on Feistel Ciphers
Persistent Fault Analysis (PFA) is an innovative and powerful analysis technique in which fault persists throughout the execution. The prior prominent results on PFA were on SPN block ciphers, and the security of Feistel ciphers against this attack has received less attention. In this paper, we introduce a framework to utilize Statistical Ineffective Fault Analysis (SIFA) in the persistent fault setting by proposing Statistical Ineffective Persistent Faults Analysis (SIPFA) that can be efficiently applied to Feistel ciphers in a variety of scenarios. To demonstrate the effectiveness of our technique, we apply SIFPA on three widely used Feistel schemes, DES, 3DES, and Camellia. Our analysis reveals that the secret key of these block ciphers can be extracted with a complexity of at most 250 utilizing a single unknown fault. Furthermore, we demonstrate that the secret can be recovered in a fraction of a second by increasing the adversary’s control over the injected faults. To evaluate SIPFA in a variety of scenarios, we conducted both simulations and real experiments utilizing electromagnetic fault injection on DES and 3DES.
2022
TCHES
SoC Root Canal! Root Cause Analysis of Power Side-Channel Leakage in System-on-Chip Designs
Finding the root cause of power-based side-channel leakage becomes harder when multiple layers of design abstraction are involved. While side-channel leakage originates in processor hardware, the dangerous consequences may only become apparent in the cryptographic software that runs on the processor. This contribution presents RootCanal, a methodology to explain the origin of side-channel leakage in a software program in terms of the underlying micro-architecture and system architecture. We simulate the hardware power consumption at the gate level and perform a non-specific test to identify the logic gates that contribute most sidechannel leakage. Then, we back-annotate those findings to the related activities in the software. The resulting analysis can automatically point out non-trivial causes of side-channel leakages. To illustrate RootCanal’s capabilities, we discuss a collection of case studies.
2022
TCHES
SoK: Fully Homomorphic Encryption over the [Discretized] Torus
First posed as a challenge in 1978 by Rivest et al., fully homomorphic encryption—the ability to evaluate any function over encrypted data—was only solved in 2009 in a breakthrough result by Gentry (Commun. ACM, 2010). After a decade of intense research, practical solutions have emerged and are being pushed for standardization.This paper explains the inner-workings of TFHE, a torus-based fully homomorphic encryption scheme. More exactly, it describes its implementation on a discretized version of the torus. It also explains in detail the technique of the programmable bootstrapping. Numerous examples are provided to illustrate the various concepts and definitions.
2022
TCHES
SoK: SCA-secure ECC in software – mission impossible?
This paper describes an ECC implementation computing the X25519 keyexchange protocol on the Arm Cortex-M4 microcontroller. For providing protections against various side-channel and fault attacks we first review known attacks and countermeasures, then we provide software implementations that come with extensive mitigations, and finally we present a preliminary side-channel evaluation. To our best knowledge, this is the first public software claiming affordable protection against multiple classes of attacks that are motivated by distinct real-world application scenarios. We distinguish between X25519 with ephemeral keys and X25519 with static keys and show that the overhead to our baseline unprotected implementation is about 37% and 243%, respectively. While this might seem to be a high price to pay for security, we also show that even our (most protected) static implementation is at least as efficient as widely-deployed ECC cryptographic libraries, which offer much less protection.
2022
TCHES
SYNFI: Pre-Silicon Fault Analysis of an Open-Source Secure Element
Fault attacks are active, physical attacks that an adversary can leverage to alter the control-flow of embedded devices to gain access to sensitive information or bypass protection mechanisms. Due to the severity of these attacks, manufacturers deploy hardware-based fault defenses into security-critical systems, such as secure elements. The development of these countermeasures is a challenging task due to the complex interplay of circuit components and because contemporary design automation tools tend to optimize inserted structures away, thereby defeating their purpose. Hence, it is critical that such countermeasures are rigorously verified post-synthesis. Since classical functional verification techniques fall short of assessing the effectiveness of countermeasures (due to the circuit being analyzed when no faults are present), developers have to resort to methods capable of injecting faults in a simulation testbench or into a physical chip sample. However, developing test sequences to inject faults in simulation is an error-prone task and performing fault attacks on a chip requires specialized equipment and is incredibly time-consuming. Moreover, identifying the fault-vulnerable circuit is hard in both approaches, and fixing potential design flaws post-silicon is usually infeasible since that would require another tape-out. To that end, this paper introduces SYNFI, a formal pre-silicon fault verification framework that operates on synthesized netlists. SYNFI can be used to analyze the general effect of faults on the input-output relationship in a circuit and its fault countermeasures, and thus enables hardware designers to assess and verify the effectiveness of embedded countermeasures in a systematic and semi-automatic way. The framework automatically extracts sensitive parts of the circuit, induces faults into the extracted subcircuit, and analyzes the faults’ effects using formal methods. To demonstrate that SYNFI is capable of handling unmodified, industry-grade netlists synthesized with commercial and open tools, we analyze OpenTitan, the first opensource secure element. In our analysis, we identified critical security weaknesses in the unprotected AES block, developed targeted countermeasures, reassessed their security, and contributed these countermeasures back to the OpenTitan project. For other fault-hardened IP, such as the life cycle controller, we used SYNFI to confirm that existing countermeasures provide adequate protection.
2022
TCHES
The Best of Two Worlds: Deep Learning-assisted Template Attack
In the last decade, machine learning-based side-channel attacks have become a standard option when investigating profiling side-channel attacks. At the same time, the previous state-of-the-art technique, template attack, started losing its importance and was more considered a baseline to compare against. As such, most of the results reported that machine learning (and especially deep learning) could significantly outperform the template attack. Nevertheless, the template attack still has certain advantages even compared to deep learning. The most significant one is that it has only a few hyperparameters to tune, making it easier to use.We take another look at the template attack, and we devise a feature engineering phase allowing the template attack to compete or even outperform state-of-the-art deep learning-based side-channel attacks. More precisely, with a novel distance metric customized for side-channel analysis, we show how a deep learning technique called similarity learning can be used to find highly efficient embeddings of input data with one-epoch training, which can then be fed into the template attack resulting in powerful attacks
2022
TCHES
The Hidden Parallelepiped Is Back Again: Power Analysis Attacks on Falcon
FALCON is a very efficient and compact lattice-based signature finalist of the NIST’s Post-Quantum standardization campaign. This work assesses Falcon’s sidechannel resistance by analyzing two vulnerabilities, namely the pre-image computation and the trapdoor sampling. The first attack is an improvement of Karabulut and Aysu (DAC 2021). It overcomes several difficulties inherent to the structure of the stored key like the Fourier representation and directly recovers the key with a limited number of traces and a reduced complexity. The main part of this paper is dedicated to our second attack: we show that a simple power analysis during the signature execution could provide the exact value of the output of a subroutine called the base sampler. This intermediate value does not directly lead to the secret and we had toadapt the so-called hidden parallelepiped attack initially introduced by Nguyen and Regev in Eurocrypt 2006 and reused by Ducas and Nguyen in Asiacrypt 2012. We extensively quantify the resources for our attacks and experimentally demonstrate them with FALCON’s reference implementation on the ELMO simulator (McCann, Oswald and Whitnall USENIX 2017) and on a ChipWhisperer Lite with STM32F3 target (ARM Cortex M4).These new attacks highlight the need for side-channel protection for one of the three finalists of NIST’s standardization campaign by pointing out the vulnerable parts and quantifying the resources of the attacks.
2022
TCHES
The Wiretap Channel for Capacitive PUF-Based Security Enclosures
In order to protect devices from physical manipulations, protective security enclosures were developed. However, these battery-backed solutions come with a reduced lifetime, and have to be actively and continuously monitored.In order to overcome these drawbacks, batteryless capacitive enclosures based on Physical Unclonable Functions (PUFs) have been developed that generate a keyencryption-key (KEK) for decryption of the key chain. In order to reproduce the PUF-key reliably and to compensate the effect of noise and environmental influences, the key generation includes error correction codes. However, drilling attacks that aim at partially destroying the enclosure also alter the PUF-response and are subjected to the same error correction procedures. Correcting attack effects, however, is highly undesirable as it would destroy the security concept of the enclosure. In general, designing error correction codes such that they provide tamper-sensitivity to attacks, while still correcting noise and environmental effects is a challenging task.We tackle this problem by first analyzing the behavior of the PUF-response under external influences and different post-processing parameters. From this, we derive a system model of the PUF-based enclosure, and construct a wiretap channel implementation from q-ary polar codes. We verify the obtained error correction scheme in a Monte Carlo simulation and demonstrate that our wiretap channel implementation achieves a physical layer security of 100 bits for 306 bits of entropy for the PUF-secret. Through this, we further develop capacitive PUF-based security enclosures and bring them one step closer to their commercial deployment.
2022
TCHES
Towards a Formal Treatment of Logic Locking
Logic locking aims to protect the intellectual property of a circuit from a fabricator by modifying the original logic of the circuit into a new “locked” circuit such that an entity without the key should not be able to learn anything about the original circuit. While logic locking provides a promising solution to outsourcing the fabrication of chips, unfortunately, several of the proposed logic locking systems have been broken. The lack of established secure techniques stems in part from the absence of a rigorous treatment toward a notion of security for logic locking, and the disconnection between practice and formalisms. We seek to address this gap by introducing formal definitions to capture the desired security of logic locking schemes. In doing so, we investigate prior definitional efforts in this space, and show that these notions either incorrectly model the desired security goals or fail to capture a natural “compositional” property that would be desirable in a logic locking system. Finally we move to constructions. First, we show that universal circuits satisfy our security notions. Second, we show that, in order to do better than universal circuits, cryptographic assumptions are necessary.
2022
TCHES
Transitional Leakage in Theory and Practice: Unveiling Security Flaws in Masked Circuits
Accelerated by the increased interconnection of highly accessible devices, the demand for effective and efficient protection of hardware designs against Side-Channel Analysis (SCA) is ever rising, causing its topical relevance to remain immense in both, academia and industry. Among a wide range of proposed countermeasures against SCA, masking is a highly promising candidate due to its sound foundations and well-understood security requirements. In addition, formal adversary models have been introduced, aiming to accurately capture real-world attack scenarios while remaining sufficiently simple to efficiently reason about the SCA resilience of designs. Here, the d-probing model is the most prominent and well-studied adversary model. Its extension, introduced as the robust d-probing model, covers physical defaults occurring in hardware implementations, particularly focusing on combinational recombinations (glitches), memory recombinations (transitions), and routing recombinations (coupling).With increasing complexity of modern cryptographic designs and logic circuits, formal security verification becomes ever more cumbersome. This started to spark innovative research on automated verification frameworks. Unfortunately, these verification frameworks mostly focus on security verification of hardware circuits in the presence of glitches, but remain limited in identification and verification of transitional leakage. To this end, we extend SILVER, a recently proposed tool for formal security verification of masked logic circuits, to also detect and verify information leakage resulting from combinations of glitches and transitions. Based on extensive case studies, we further confirm the accuracy and practical relevance of our methodology when assessing and verifying information leakage in hardware implementations.
2022
TCHES
Triplex: an Efficient and One-Pass Leakage-Resistant Mode of Operation
This paper introduces and analyzes Triplex, a leakage-resistant mode of operation based on Tweakable Block Ciphers (TBCs) with 2n-bit tweaks. Triplex enjoys beyond-birthday ciphertext integrity in the presence of encryption and decryption leakage in a liberal model where all intermediate computations are leaked in full and only two TBC calls operating a long-term secret are protected with implementationlevel countermeasures. It provides beyond-birthday confidentiality guarantees without leakage, and standard confidentiality guarantees with leakage for a single-pass mode embedding a re-keying process for the bulk of its computations (i.e., birthday confidentiality with encryption leakage under a bounded leakage assumption). Triplex improves leakage-resistant modes of operation relying on TBCs with n-bit tweaks when instantiated with large-tweak TBCs like Deoxys-TBC (a CAESAR competition laureate) or Skinny (used by the Romulus finalist of the NIST lightweight crypto competition). Its security guarantees are maintained in the multi-user setting.
2022
TCHES
VERICA - Verification of Combined Attacks: Automated formal verification of security against simultaneous information leakage and tampering
Physical attacks, including passive Side-Channel Analysis and active Fault Injection Analysis, are considered among the most powerful threats against physical cryptographic implementations. These attacks are well known and research provides many specialized countermeasures to protect cryptographic implementations against them. Still, only a limited number of combined countermeasures, i.e., countermeasures that protect implementations against multiple attacks simultaneously, were proposed in the past. Due to increasing complexity and reciprocal effects, design of efficient and reliable combined countermeasures requires longstanding expertise in hardware design and security. With the help of formal security specifications and adversary models, automated verification can streamline development cycles, increase quality, and facilitate development of robust cryptographic implementations.In this work, we revise and refine formal security notions for combined protection mechanisms and specifically embed them in the context of hardware implementations. Based on this, we present the first automated verification framework that can verify physical security properties of hardware circuits with respect to combined physical attacks. To this end, we conduct several case studies to demonstrate the capabilities and advantages of our framework, analyzing secure building blocks (gadgets), S-boxes build from Toffoli gates, and the ParTI scheme. For the first time, we reveal security flaws in analyzed structures due to reciprocal effects, highlighting the importance of continuously integrating security verification into modern design and development cycles.
2022
TCHES
Verified NTT Multiplications for NISTPQC KEM Lattice Finalists: Kyber, SABER, and NTRU
Postquantum cryptography requires a different set of arithmetic routines from traditional public-key cryptography such as elliptic curves. In particular, in each of the lattice-based NISTPQC Key Establishment finalists, every state-ofthe-art optimized implementation for lattice-based schemes still in the NISTPQC round 3 currently uses a different complex multiplication based on the Number Theoretic Transform. We verify the NTT-based multiplications used in NTRU, Kyber, and SABER for both the AVX2 implementation for Intel CPUs and for the pqm4 implementation for the ARM Cortex M4 using the tool CryptoLine. e extended CryptoLine and as a result are able to verify that in six instances multiplications are correct including range properties.We demonstrate the feasibility for a programmer to verify his or her high-speed assembly code for PQC, as well as to verify someone else’s high-speed PQC software in assembly code, with some cooperation from the programmer.
2022
TCHES
VITI: A Tiny Self-Calibrating Sensor for Power-Variation Measurement in FPGAs
On-chip sensors, built using reconfigurable logic resources in field programmable gate arrays (FPGAs), have been shown to sense variations in signalpropagation delay, supply voltage and power consumption. These sensors have been successfully used to deploy security attacks called Remote Power Analysis (RPA) Attacks on FPGAs. The sensors proposed thus far consume significant logic resources and some of them could be used to deploy power viruses. In this paper, a sensor (named VITI) occupying a far smaller footprint than existing sensors is presented. VITI is a self-calibrating on-chip sensor design, constructed using adjustable delay elements, flip-flops and LUT elements instead of combinational loops, bulky carry chains or latches. Self-calibration enables VITI the autonomous adaptation to differing situations (such as increased power consumption, temperature changes or placement of the sensor in faraway locations from the circuit under attack). The efficacy of VITI for power consumption measurement was evaluated using Remote Power Analysis (RPA) attacks and results demonstrate recovery of a full 128-bit Advanced Encryption Standard (AES) key with only 20,000 power traces. Experiments demonstrate that VITI consumes 1/4th and 1/16th of the area compared to state-of-the-art sensors such as time to digital converters and ring oscillators for similar effectiveness.
2022
TCHES
When Bad News Become Good News: Towards Usable Instances of Learning with Physical Errors
Hard physical learning problems have been introduced as an alternative option to implement cryptosystems based on hard learning problems. Their high-level idea is to use inexact computing to generate erroneous computations directly, rather than to first compute correctly and add errors afterwards. Previous works focused on the applicability of this idea to the Learning Parity with Noise (LPN) problem as a first step, and formalized it as Learning Parity with Physical Noise (LPPN). In this work, we generalize it to the Learning With Errors (LWE) problem, formalized as Learning With Physical Errors (LWPE). We first show that the direct application of the design ideas used for LPPN prototypes leads to a new source of (mathematical) data dependencies in the error distributions that can reduce the security of the underlying problem. We then show that design tweaks can be used to avoid this issue, making LWPE samples natively robust against such data dependencies. We additionally put forward that these ideas open a quite wide design space that could make hard physical learning problems relevant in various applications. And we conclude by presenting a first prototype FPGA design confirming our claims.
2022
TCHES
When the Decoder Has to Look Twice: Glitching a PUF Error Correction
Physical Unclonable Functions (PUFs) have been increasingly used as an alternative to non-volatile memory for the storage of cryptographic secrets. Research on side channel and fault attacks with the goal of extracting these secrets has begun to gain interest but no fault injection attack targeting the necessary error correction within a PUF device has been shown so far. This work demonstrates one such attack on a hardware fuzzy commitment scheme implementation and thus shows a new potential attack threat existing in current PUF key storage systems. After presenting evidence for the overall viability of the profiled attack by performing it on an FPGA implementation, countermeasures are analysed: we discuss the efficacy of hashing helper data with the PUF-derived key to prevent the attack as well as codeword masking, a countermeasure effective against a side channel attack. The analysis shows the limits of these approaches. First, we demonstrate the criticality of timing in codeword masking by confirming the attack’s effectiveness on ostensibly protected hardware. Second, our work shows a successful attack without helper data manipulation and thus the potential for sidestepping helper data hashing countermeasures.
2022
TCHES
Will You Cross the Threshold for Me? Generic Side-Channel Assisted Chosen-Ciphertext Attacks on NTRU-based KEMs
In this work, we propose generic and novel side-channel assisted chosenciphertext attacks on NTRU-based key encapsulation mechanisms (KEMs). These KEMs are IND-CCA secure, that is, they are secure in the chosen-ciphertext model. Our attacks involve the construction of malformed ciphertexts. When decapsulated by the target device, these ciphertexts ensure that a targeted intermediate variable becomes very closely related to the secret key. An attacker, who can obtain information about the secret-dependent variable through side-channels, can subsequently recover the full secret key. We propose several novel CCAs which can be carried through by using side-channel leakage from the decapsulation procedure. The attacks instantiate three different types of oracles, namely a plaintext-checking oracle, a decryptionfailure oracle, and a full-decryption oracle, and are applicable to two NTRU-based schemes, which are NTRU and NTRU Prime. The two schemes are candidates in the ongoing NIST standardization process for post-quantum cryptography. We perform experimental validation of the attacks on optimized and unprotected implementations of NTRU-based schemes, taken from the open-source pqm4 library, using the EM-based side-channel on the 32-bit ARM Cortex-M4 microcontroller. All of our proposed attacks are capable of recovering the full secret key in only a few thousand chosen ciphertext queries on all parameter sets of NTRU and NTRU Prime. Our attacks, therefore, stress on the need for concrete side-channel protection strategies for NTRUbased KEMs.