International Association for Cryptologic Research

International Association
for Cryptologic Research

CryptoDB

Papers from Transactions on Cryptographic Hardware and Embedded Systems 2020

Year
Venue
Title
2020
TCHES
A Compact and Scalable Hardware/Software Co-design of SIKE 📺
We present efficient and compact hardware/software co-design implementations of the Supersingular Isogeny Key Encapsulation (SIKE) protocol on field-programmable gate arrays (FPGAs). In order to be better equipped for different post-quantum scenarios, our architectures were designed to feature high-flexibility by covering all the currently available parameter sets and with support for primes up to 1016 bits. In particular, any of the current SIKE parameters equivalent to the post-quantum security of AES-128/192/256 and SHA3-256 can be selected and run on-the-fly. This security scalability property, together with the small footprint and efficiency of our architectures, makes them ideal for embedded applications in a post-quantum world. In addition, the proposed implementations exhibit regular, constant-time execution, which provides protection against timing and simple sidechannel attacks. Our results demonstrate that supersingular isogeny-based primitives such as SIDH and SIKE can indeed be deployed for embedded applications featuring competitive performance. For example, our smallest architecture based on a 128-bit MAC unit takes only 3415 slices, 21 BRAMs and 57 DSPs on a Virtex 7 690T and can perform key generation, encapsulation and decapsulation in 14.4, 24.4 and 26.0 milliseconds for SIKEp434 and in 52.3, 86.4 and 93.2 milliseconds for SIKEp751, respectively.
2020
TCHES
A Fast and Accurate Guessing Entropy Estimation Algorithm for Full-key Recovery 📺
Guessing entropy (GE) is a widely adopted metric that measures the average computational cost needed for a successful side-channel analysis (SCA). However, with current estimation methods where the evaluator has to average the correct key rank over many independent side-channel leakage measurement sets, full-key GE estimation is impractical due to its prohibitive computing requirement. A recent estimation method based on posterior probabilities, although scalable, is not accurate.We propose a new guessing entropy estimation algorithm (GEEA) based on theoretical distributions of the ranking score vectors. By discovering the relationship of GE with pairwise success rates and utilizing it, GEEA uses a sum of many univariate Gaussian probabilities instead of multi-variate Gaussian probabilities, significantly improving the computation efficiency.We show that GEEA is more accurate and efficient than all current GE estimations. To the best of our knowledge, it is the only practical full-key GE evaluation on given experimental data sets which the evaluator has access to. Moreover, it can accurately predict the GE for larger sizes than the experimental data sets, providing comprehensive security evaluation.
2020
TCHES
A Hybrid-CPU-FPGA-based Solution to the Recovery of Sha256crypt-hashed Passwords 📺
This paper presents an accelerator design for the password recovery of sha256crypt based on hybrid CPU-FPGA devices. By applying the brute-force attack computation model proposed in this paper, we decompose the sha256crypt function into two types of operations, namely the data dispatching and the block transforming. The data dispatching operation generates message blocks and the block transforming operation transforms message blocks into digests. These two operations are efficiently accelerated by the customized data dispatch unit and the pipelined block transform unit, respectively. Difficulties of adopting the pipeline technique are addressed also with the following techniques. The group scheduling is used to solve the data dependency that stalls the pipeline. The look-ahead execution eliminates the uncertainty of the execution path. The data path pruning and spatial-temporal multiplexing reduce the resource overhead of non-computing units.The proposed accelerator design is implemented and evaluated on the Xilinx Zynq-7000 XC7Z030-3 SoC. Our experimental results show that the proposed accelerator can improve energy efficiency by 2.54x over the state-of-the-art password recovery tool Hashcat running on an NVIDIA GTX1080Ti GPU. Compared with the pure FPGA-based implementation in John-the-Ripper, the proposed accelerator improves energy efficiency by 1.64x and improves resource efficiency by 1.69x.
2020
TCHES
A Novel Evaluation Metric for Deep Learning-Based Side Channel Analysis and Its Extended Application to Imbalanced Data 📺
Since Kocher (CRYPTO’96) proposed timing attack, side channel analysis (SCA) has shown great potential to break cryptosystems via physical leakage. Recently, deep learning techniques are widely used in SCA and show equivalent and even better performance compared to traditional methods. However, it remains unknown why and when deep learning techniques are effective and efficient for SCA. Masure et al. (IACR TCHES 2020(1):348–375) illustrated that deep learning paradigm is suitable for evaluating implementations against SCA from a worst-case scenario point of view, yet their work is limited to balanced data and a specific loss function. Besides, deep learning metrics are not consistent with side channel metrics. In most cases, they are deceptive in foreseeing the feasibility and complexity of mounting a successful attack, especially for imbalanced data. To mitigate the gap between deep learning metrics and side channel metrics, we propose a novel Cross Entropy Ratio (CER) metric to evaluate the performance of deep learning models for SCA. CER is closely related to traditional side channel metrics Guessing Entropy (GE) and Success Rate (SR) and fits to deep learning scenario. Besides, we show that it works stably while deep learning metrics such as accuracy becomes rather unreliable when the training data tends to be imbalanced. However, estimating CER can be done as easy as natural metrics in deep learning algorithms with low computational complexity. Furthermore, we adapt CER metric to a new kind of loss function, namely CER loss function, designed specifically for deep learning in side channel scenario. In this way, we link directly the SCA objective to deep learning optimization. Our experiments on several datasets show that, for SCA with imbalanced data, CER loss function outperforms Cross Entropy loss function in various conditions.
2020
TCHES
Compact Dilithium Implementations on Cortex-M3 and Cortex-M4 📺
We present implementations of the lattice-based digital signature scheme Dilithium for ARM Cortex-M3 and ARM Cortex-M4. Dilithium is one of the three signature finalists of the NIST post-quantum cryptography competition. As our Cortex-M4 target, we use the popular STM32F407-DISCOVERY development board. Compared to the previous speed records on the Cortex-M4 by Ravi, Gupta, Chattopadhyay, and Bhasin we speed up the key operations NTT and NTT−1 by 20% which together with other optimizations results in speedups of 7%, 15%, and 9% for Dilithium3 key generation, signing, and verification respectively. We also present the first constant-time Dilithium implementation on the Cortex-M3 and use the Arduino Due for benchmarks. For Dilithium3, we achieve on average 2 562 kilocycles for key generation, 10 667 kilocycles for signing, and 2 321 kilocycles for verification.Additionally, we present stack consumption optimizations applying to both our Cortex- M3 and Cortex-M4 implementation. Due to the iterative nature of the Dilithium signing algorithm, there is no optimal way to achieve the best speed and lowest stack consumption at the same time. We present three different strategies for the signing procedure which allow trading more stack and flash memory for faster speed or viceversa. Our implementation of Dilithium3 with the smallest memory footprint uses less than 12kB. As an additional output of this work, we present the first Cortex-M3 implementations of the key-encapsulation schemes NewHope and Kyber.
2020
TCHES
Concrete quantum cryptanalysis of binary elliptic curves 📺
This paper analyzes and optimizes quantum circuits for computing discrete logarithms on binary elliptic curves, including reversible circuits for fixed-base-point scalar multiplication and the full stack of relevant subroutines. The main optimization target is the size of the quantum computer, i.e., the number of logical qubits required, as this appears to be the main obstacle to implementing Shor’s polynomial-time discrete-logarithm algorithm. The secondary optimization target is the number of logical Toffoli gates. For an elliptic curve over a field of 2n elements, this paper reduces the number of qubits to 7n + ⌊log2(n)⌋ + 9. At the same time this paper reduces the number of Toffoli gates to 48n3 + 8nlog2(3)+1 + 352n2 log2(n) + 512n2 + O(nlog2(3)) with double-and-add scalar multiplication, and a logarithmic factor smaller with fixed-window scalar multiplication. The number of CNOT gates is also O(n3). Exact gate counts are given for various sizes of elliptic curves currently used for cryptography.
2020
TCHES
Cortex-M4 optimizations for {R,M} LWE schemes 📺
This paper proposes various optimizations for lattice-based key encapsulation mechanisms (KEM) using the Number Theoretic Transform (NTT) on the popular ARM Cortex-M4 microcontroller. Improvements come in the form of a faster code using more efficient modular reductions, optimized small-degree polynomial multiplications, and more aggressive layer merging in the NTT, but also in the form of reduced stack usage. We test our optimizations in software implementations of Kyber and NewHope, both round 2 candidates in the NIST post-quantum project, and also NewHope-Compact, a recently proposed variant of NewHope with smaller parameters. Our software is the first implementation of NewHope-Compact on the Cortex-M4 and shows speed improvements over previous high-speed implementations of Kyber and NewHope. Moreover, it gives a common framework to compare those schemes with the same level of optimization. Our results show that NewHope- Compact is the fastest scheme, followed by Kyber, and finally NewHope, which seems to suffer from its large modulus and error distribution for small dimensions.
2020
TCHES
CPAmap: On the Complexity of Secure FPGA Virtualization, Multi-Tenancy, and Physical Design 📺
With virtualized Field Programmable Gate Arrays (FPGAs) on the verge of being deployed to the cloud computing domain, there is a rising interest in resolving recently identified security issues. Those issues result from different trusted and untrusted entities sharing the FPGA fabric and the Power Distribution Network. Researchers were able to perform both side-channel and fault attacks between logically isolated designs on the same FPGA fabric, compromising security of cryptographic modules and other critical implementations. Side-channel attacks specifically are enabled by the vast degree of freedom given to developers when making use of the basic FPGA resources. Both ring oscillators as well as long delay lines, implemented using low-level FPGA primitives, have been shown to provide sufficient data for simple or correlation-based power analysis attacks. In order to develop new or apply known countermeasures onto designs and implementations in a virtualized multi-tenant FPGA, we seek to fully understand the underlying mechanisms and dependencies of chip-internal side-channel attacks. Although the impact of process variation and other physical design parameters on side-channel vulnerability has been investigated in previous works, remote attacks between logically isolated partitions in multi-tenant FPGAs introduce new and unique challenges. Thus, we systematically analyze the impact of physical mapping of both attacker and victim design on the success of correlation power analysis attacks on the Advanced Encryption Standard (AES). We report our findings on a Xilinx Zynq 7000-based platform, which show that the effect of global and local placement as well as routing and process variation on the success of side-channel attacks almost exceeds the impact of hiding countermeasures. This result reveals fundamental challenges in secure virtualization of FPGAs, which have been mostly ignored so far. Eventually, our results may also help vendors and hypervisors in developing zero overhead side-channel countermeasures based on adequate global and local placement of isolated designs on a multi-tenant FPGA.
2020
TCHES
DANA Universal Dataflow Analysis for Gate-Level Netlist Reverse Engineering 📺
Reverse engineering of integrated circuits, i.e., understanding the internals of Integrated Circuits (ICs), is required for many benign and malicious applications. Examples of the former are detection of patent infringements, hardware Trojans or Intellectual Property (IP)-theft, as well as interface recovery and defect analysis, while malicious applications include IP-theft and finding insertion points for hardware Trojans. However, regardless of the application, the reverse engineer initially starts with a large unstructured netlist, forming an incomprehensible sea of gates.This work presents DANA, a generic, technology-agnostic, and fully automated dataflow analysis methodology for flattened gate-level netlists. By analyzing the flow of data between individual Flip Flops (FFs), DANA recovers high-level registers. The key idea behind DANA is to combine independent metrics based on structural and control information with a powerful automated architecture. Notably, DANA works without any thresholds, scenario-dependent parameters, or other “magic” values that the user must choose. We evaluate DANA on nine modern hardware designs, ranging from cryptographic co-processors, over CPUs, to the OpenTitan, a stateof- the-art System-on-Chip (SoC), which is maintained by the lowRISC initiative with supporting industry partners like Google and Western Digital. Our results demonstrate almost perfect recovery of registers for all case studies, regardless whether they were synthesized as FPGA or ASIC netlists. Furthermore, we explore two applications for dataflow analysis: we show that the raw output of DANA often already allows to identify crucial components and high-level architecture features and also demonstrate its applicability for detecting simple hardware Trojans.Hence, DANA can be applied universally as the first step when investigating unknown netlists and provides major guidance for human analysts by structuring and condensing the otherwise incomprehensible sea of gates. Our implementation of DANA and all synthesized netlists are available as open source on GitHub.
2020
TCHES
DAPA: Differential Analysis aided Power Attack on (Non-) Linear Feedback Shift Registers 📺
Differential power analysis (DPA) is a form of side-channel analysis (SCA) that performs statistical analysis on the power traces of cryptographic computations. DPA is applicable to many cryptographic primitives, including block ciphers, stream ciphers and even hash-based message authentication code (HMAC). At COSADE 2017, Dobraunig et al. presented a DPA on the fresh re-keying scheme Keymill to extract the bit relations of neighbouring bits in its shift registers, reducing the internal state guessing space from 128 to 4 bits. In this work, we generalise their methodology and combine with differential analysis, we called it differential analysis aided power attack (DAPA), to uncover more bit relations and take into account the linear or non-linear functions that feedback to the shift registers (i.e. LFSRs or NLFSRs). Next, we apply our DAPA on LR-Keymill, the improved version of Keymill designed to resist the aforementioned DPA, and breaks its 67.9-bit security claim with a 4-bit internal state guessing. We experimentally verified our analysis. In addition, we improve the previous DPA on Keymill by halving the amount of data resources needed for the attack. We also applied our DAPA to Trivium, a hardware-oriented stream cipher from the eSTREAM portfolio and reduces the key guessing space from 80 to 14 bits.
2020
TCHES
Defeating State-of-the-Art White-Box Countermeasures with Advanced Gray-Box Attacks 📺
The goal of white-box cryptography is to protect secret keys embedded in a cryptographic software deployed in an untrusted environment. In this article, we revisit state-of-the-art countermeasures employed in white-box cryptography, and we discuss possible ways to combine them. Then we analyze the different gray-box attack paths and study their performances in terms of required traces and computation time. Afterward, we propose a new paradigm for the gray-box attack against white-box cryptography, which exploits the data-dependency of the target implementation. We demonstrate that our approach provides substantial complexity improvements over the existing attacks. Finally, we showcase this new technique by breaking the three winning AES-128 white-box implementations from WhibOx 2019 white-box cryptography competition.
2020
TCHES
Dismantling DST80-based Immobiliser Systems 📺
Car manufacturers deploy vehicle immobiliser systems in order to prevent car theft. However, in many cases the underlying cryptographic primitives used to authenticate a transponder are proprietary in nature and thus not open to public scrutiny. In this paper we publish the proprietary Texas Instruments DST80 cipher used in immobilisers of several manufacturers. Additionally, we expose serious flaws in immobiliser systems of major car manufacturers such as Toyota, Kia, Hyundai and Tesla. Specifically, by voltage glitching the firmware protection mechanisms of the microcontroller, we extracted the firmware from several immobiliser ECUs and reverse engineered the key diversification schemes employed within. We discovered that Kia and Hyundai immobiliser keys have only three bytes of entropy and that Toyota only relies on publicly readable information such as the transponder serial number and three constants to generate cryptographic keys. Furthermore, we present several practical attacks which can lead to recovering the full 80-bit cryptographic key in a matter of seconds or permanently disabling the transponder. Finally, even without key management or configuration issues, we demonstrate how an attacker can recover the cryptographic key using a profiled side-channel attack. We target the key loading procedure and investigate the practical applicability in the context of portability. Our work once again highlights the issues automotive vendors face in implementing cryptography securely.
2020
TCHES
Doppelganger Obfuscation — Exploring theDefensive and Offensive Aspects of Hardware Camouflaging 📺
Hardware obfuscation is widely used in practice to counteract reverse engineering. In recent years, low-level obfuscation via camouflaged gates has been increasingly discussed in the scientific community and industry. In contrast to classical high-level obfuscation, such gates result in recovery of an erroneous netlist. This technology has so far been regarded as a purely defensive tool. We show that low-level obfuscation is in fact a double-edged sword that can also enable stealthy malicious functionalities.In this work, we present Doppelganger, the first generic design-level obfuscation technique that is based on low-level camouflaging. Doppelganger obstructs central control modules of digital designs, e.g., Finite State Machines (FSMs) or bus controllers, resulting in two different design functionalities: an apparent one that is recovered during reverse engineering and the actual one that is executed during operation. Notably, both functionalities are under the designer’s control.In two case studies, we apply Doppelganger to a universal cryptographic coprocessor. First, we show the defensive capabilities by presenting the reverse engineer with a different mode of operation than the one that is actually executed. Then, for the first time, we demonstrate the considerable threat potential of low-level obfuscation. We show how an invisible, remotely exploitable key-leakage Trojan can be injected into the same cryptographic coprocessor just through obfuscation. In both applications of Doppelganger, the resulting design size is indistinguishable from that of an unobfuscated design, depending on the choice of encodings.
2020
TCHES
Efficient and Private Computations with Code-Based Masking 📺
Code-based masking is a very general type of masking scheme that covers Boolean masking, inner product masking, direct sum masking, and so on. The merits of the generalization are twofold. Firstly, the higher algebraic complexity of the sharing function decreases the information leakage in “low noise conditions” and may increase the “statistical security order” of an implementation (with linear leakages). Secondly, the underlying error-correction codes can offer improved fault resistance for the encoded variables. Nevertheless, this higher algebraic complexity also implies additional challenges. On the one hand, a generic multiplication algorithm applicable to any linear code is still unknown. On the other hand, masking schemes with higher algebraic complexity usually come with implementation overheads, as for example witnessed by inner-product masking. In this paper, we contribute to these challenges in two directions. Firstly, we propose a generic algorithm that allows us (to the best of our knowledge for the first time) to compute on data shared with linear codes. Secondly, we introduce a new amortization technique that can significantly mitigate the implementation overheads of code-based masking, and illustrate this claim with a case study. Precisely, we show that, although performing every single code-based masked operation is relatively complex, processing multiple secrets in parallel leads to much better performances. This property enables code-based masked implementations of the AES to compete with the state-of-the-art in randomness complexity. Since our masked operations can be instantiated with various linear codes, we hope that these investigations open new avenues for the study of code-based masking schemes, by specializing the codes for improved performances, better side-channel security or improved fault tolerance.
2020
TCHES
Exploring Crypto-Physical Dark Matter and Learning with Physical Rounding: Towards Secure and Efficient Fresh Re-Keying 📺
State-of-the-art re-keying schemes can be viewed as a tradeoff between efficient but heuristic solutions based on binary field multiplications, that are only secure if implemented with a sufficient amount of noise, and formal but more expensive solutions based on weak pseudorandom functions, that remain secure if the adversary accesses their output in full. Recent results on “crypto dark matter” (TCC 2018) suggest that low-complexity pseudorandom functions can be obtained by mixing linear functions over different small moduli. In this paper, we conjecture that by mixing some matrix multiplications in a prime field with a physical mapping similar to the leakage functions exploited in side-channel analysis, we can build efficient re-keying schemes based on “crypto-physical dark matter”, that remain secure against an adversary who can access noise-free measurements. We provide first analyzes of the security and implementation properties that such schemes provide. Precisely, we first show that they are more secure than the initial (heuristic) proposal by Medwed et al. (AFRICACRYPT 2010). For example, they can resist attacks put forward by Belaid et al. (ASIACRYPT 2014), satisfy some relevant cryptographic properties and can be connected to a “Learning with Physical Rounding” problem that shares some similarities with standard learning problems. We next show that they are significantly more efficient than the weak pseudorandom function proposed by Dziembowski et al. (CRYPTO 2016), by exhibiting hardware implementation results.
2020
TCHES
Faster Montgomery and double-add ladders for short Weierstrass curves 📺
The Montgomery ladder and Joye ladder are well-known algorithms for elliptic curve scalar multiplication with a regular structure. The Montgomery ladder is best known for its implementation on Montgomery curves, which requires 5M+4S+1m+8A per scalar bit, and 6 field registers. Here (M, S,m,A) represent respectively field Multiplications, Squarings, multiplications by a curve constant, and Additions or subtractions. This ladder is also complete, meaning that it works on all input points and all scalars. Many protocols do not use Montgomery curves, but instead use prime-order curves in short Weierstrass form. These have historically been much slower, with ladders costing at least 14 multiplications or squarings per bit: 8M + 6S + 27A for the Montgomery ladder and 8M+ 6S + 30A for the Joye ladder. In 2017, Kim et al. improved the Montgomery ladder to 8M+ 4S + 12A + 1H per bit using 9 registers, where the H represents a halving. Hamburg simplified Kim et al.’s formulas to 8M+ 4S + 8A + 1H per bit using 6 registers. Here we present improved formulas which compute the Montgomery ladder on short Weierstrass curves using 8M+ 3S + 7A per bit, and requiring 6 registers. We also give formulas for the Joye ladder that use 9M+3S+7A per bit, requiring 5 registers. One of our new formulas supports very efficient 4-way vectorization. We also discuss curve invariants, exceptional points, side-channel protection and how to set up and finish these ladder operations. Finally, we show a novel technique to make these ladders complete when the curve order is not divisible by 2 or 3, at a modest increase in cost. A sample implementation of these techniques is given in the supplementary material, also posted at https://github.com/bitwiseshiftleft/ladder_formulas
2020
TCHES
Fault Injection as an Oscilloscope: Fault Correlation Analysis 📺
Fault Injection (FI) attacks have become a practical threat to modern cryptographic implementations. Such attacks have recently focused more on exploitation of implementation-centric and device-specific properties of the faults. In this paper, we consider the parallel between SCA attacks and FI attacks; specifically, that many FI attacks rely on the data-dependency of activation and propagation of a fault, and SCA attacks similarly rely on data-dependent power usage. In fact, these are so closely related that we show that existing SCA attacks can be directly applied in a purely FI setting, by translating power FI results to generate FI ‘probability traces’ as an analogue of power traces. We impose only the requirements of the equivalent SCA attack (e.g., knowledge of the input plaintext for CPA on the first round), along with a way to observe the status of the target (whether or not it has failed and been “muted” after a fault). We also analyse existing attacks such as Fault Template Analysis in the light of this parallel, and discuss the limitations of our methodology. To demonstrate that our attacks are practical, we first show that SPA can be used to recover RSA private exponents using FI attacks. Subsequently, we show the generic nature of our attacks by performing DPA on AES after applying FI attacks to several different targets (with AVR, 32-bit ARM and RISC-V CPUs), using different software on each target, and do so with a low-cost (i.e., less than $50) power fault injection setup. We call this technique Fault Correlation Analysis (FCA), since we perform CPA on fault probability traces. To show that this technique is not limited to software, we also present FCA results against the hardware AES engine supported by one of our targets. Our results show that even without access to the ciphertext (e.g., where an FI redundancy countermeasure is in place, or where ciphertext is simply not exposed to an attacker in any circumstance) and in the presence of light jitter, FCA attacks can successfully recover keys on each of these targets.
2020
TCHES
FEDS: Comprehensive Fault Attack Exploitability Detection for Software Implementations of Block Ciphers 📺
Fault injection attacks are one of the most powerful forms of cryptanalytic attacks on ciphers. A single, precisely injected fault during the execution of a cipher like the AES, can completely reveal the key within a few milliseconds. Software implementations of ciphers, therefore, need to be thoroughly evaluated for such attacks. In recent years, automated tools have been developed to perform these evaluations. These tools either work on the cipher algorithm or on their implementations. Tools that work at the algorithm level can provide a comprehensive assessment of fault attack vulnerability for different fault attacks and with different fault models. Their application is, however, restricted because every realization of the cipher has unique vulnerabilities. On the other hand, tools that work on cipher implementations have a much wider application but are often restricted by the range of fault attacks and the number of fault models they can evaluate.In this paper, we propose a framework, called FEDS, that uses a combination of compiler techniques and model checking to merge the advantages of both, algorithmic level tools as well as implementation level tools. Like the algorithmic level tools, FEDS can provide a comprehensive assessment of fault attack exploitability considering a wide range of fault attacks and fault models. Like implementation level tools, FEDS works with implementations, therefore has wide application. We demonstrate the versatility of FEDS by evaluating seven different implementations of AES (including bitsliced implementation) and implementations of CLEFIA and CAMELLIA for Differential Fault Attacks. The framework automatically identifies exploitable instructions in all implementations. Further, we present an application of FEDS in a Fault Attack Aware Compiler, that can automatically identify and protect exploitable regions of the code. We demonstrate that the compiler can generate significantly more efficient code than a naïvely protected equivalent, while maintaining the same level of protection.
2020
TCHES
FENL: an ISE to mitigate analogue micro-architectural leakage 📺
Ge et al. [GYH18] propose the augmented ISA (or aISA), a central tenet of which is the selective exposure of micro-architectural resources via a less opaque abstraction than normal. The aISA proposal is motivated by the need for control over such resources, for example to implement robust countermeasures against microarchitectural attacks. In this paper, we apply an aISA-style approach to challenges stemming from analogue micro-architectural leakage; examples include power-based Hamming weight and distance leakage from relatively fine-grained resources (e.g., pipeline registers), which are not exposed in, and so cannot be reliably controlled via, a normal ISA. Specifically, we design, implement, and evaluate an ISE named FENL: the ISE acts as a fence for leakage, preventing interaction between, and hence leakage from, instructions before and after it in program order. We demonstrate that the implementation and use of FENL has relatively low overhead, and represents an effective tool for systematically localising and reducing leakage.
2020
TCHES
Fill your Boots: Enhanced Embedded Bootloader Exploits via Fault Injection and Binary Analysis 📺
The bootloader of an embedded microcontroller is responsible for guarding the device’s internal (flash) memory, enforcing read/write protection mechanisms. Fault injection techniques such as voltage or clock glitching have been proven successful in bypassing such protection for specific microcontrollers, but this often requires expensive equipment and/or exhaustive search of the fault parameters. When multiple glitches are required (e.g., when countermeasures are in place) this search becomes of exponential complexity and thus infeasible. Another challenge which makes embedded bootloaders notoriously hard to analyse is their lack of debugging capabilities.This paper proposes a grey-box approach that leverages binary analysis and advanced software exploitation techniques combined with voltage glitching to develop a powerful attack methodology against embedded bootloaders. We showcase our techniques with three real-world microcontrollers as case studies: 1) we combine static and on-chip dynamic analysis to enable a Return-Oriented Programming exploit on the bootloader of the NXP LPC microcontrollers; 2) we leverage on-chip dynamic analysis on the bootloader of the popular STM8 microcontrollers to constrain the glitch parameter search, achieving the first fully-documented multi-glitch attack on a real-world target; 3) we apply symbolic execution to precisely aim voltage glitches at target instructions based on the execution path in the bootloader of the Renesas 78K0 automotive microcontroller. For each case study, we show that using inexpensive, open-design equipment, we are able to efficiently breach the security of these microcontrollers and get full control of the protected memory, even when multiple glitches are required. Finally, we identify and elaborate on several vulnerable design patterns that should be avoided when implementing embedded bootloaders.
2020
TCHES
Fixslicing AES-like Ciphers: New bitsliced AES speed records on ARM-Cortex M and RISC-V 📺
The fixslicing implementation strategy was originally introduced as a new representation for the hardware-oriented GIFT block cipher to achieve very efficient software constant-time implementations. In this article, we show that the fundamental idea underlying the fixslicing technique is not of interest only for GIFT, but can be applied to other ciphers as well. Especially, we study the benefits of fixslicing in the case of AES and show that it allows to reduce by 52% the amount of operations required by the linear layer when compared to the current fastest bitsliced implementation on 32-bit platforms. Overall, we report that fixsliced AES-128 allows to reach 80 and 91 cycles per byte on ARM Cortex-M and E31 RISC-V processors respectively (assuming pre-computed round keys), improving the previous records on those platforms by 21% and 26%. In order to highlight that our work also directly improves masked implementations that rely on bitslicing, we report implementation results when integrating first-order masking that outperform by 12% the fastest results reported in the literature on ARM Cortex-M4. Finally, we demonstrate the genericity of the fixslicing technique for AES-like designs by applying it to the Skinny-128 tweakable block ciphers.
2020
TCHES
Fixslicing: A New GIFT Representation: Fast Constant-Time Implementations of GIFT and GIFT-COFB on ARM Cortex-M 📺
The GIFT family of lightweight block ciphers, published at CHES 2017, offers excellent hardware performance figures and has been used, in full or in part, in several candidates of the ongoing NIST lightweight cryptography competition. However, implementation of GIFT in software seems complex and not efficient due to the bit permutation composing its linear layer (a feature shared with PRESENT cipher). In this article, we exhibit a new non-trivial representation of the GIFT family of block ciphers over several rounds. This new representation, that we call fixslicing, allows extremely efficient software bitsliced implementations of GIFT, using only a few rotations, surprisingly placing GIFT as a very efficient candidate on micro-controllers. Our constant time implementations show that, on ARM Cortex-M3, 128-bit data can be ciphered with only about 800 cycles for GIFT-64 and about 1300 cycles for GIFT-128 (assuming pre-computed round keys). In particular, this is much faster than the impressive PRESENT implementation published at CHES 2017 that requires 2116 cycles in the same setting, or the current best AES constant time implementation reported that requires 1617 cycles. This work impacts GIFT, but also improves software implementations of all other cryptographic primitives directly based on it or strongly related to it.
2020
TCHES
From A to Z: Projective coordinates leakage in the wild 📺
At EUROCRYPT 2004, Naccache et al. showed that the projective coordinates representation of the resulting point of an elliptic curve scalar multiplication potentially allows to recover some bits of the scalar. However, this attack has received little attention by the scientific community, and the status of deployed mitigations to prevent it in widely adopted cryptography libraries is unknown. In this paper, we aim to fill this gap, by analyzing several cryptography libraries in this context. To demonstrate the applicability of the attack, we use a side-channel attack to exploit this vulnerability within libgcrypt in the context of ECDSA. To the best of our knowledge, this is the first practical attack instance. It targets the insecure binary extended Euclidean algorithm implementation using a microarchitectural side-channel attack that allows recovering the projective representation of the output point of scalar multiplication during ECDSA signature generation. We captured 100k traces to estimate the number of traces an attacker would need to compromise the libgcrypt ECDSA implementation, resulting in less than 2k for commonly used elliptic curve secp256r1, demonstrating the attack feasibility. During exploitation, we found two additional vulnerabilities. However, we remark the purpose of this paper is not merely exploiting a library but about providing an analysis on the projective coordinates vulnerability status in widely deployed open-source libraries, filling a gap between its original description in the academic literature and the adoption of countermeasures to thwart it in real-world applications.
2020
TCHES
Generic Side-channel attacks on CCA-secure lattice-based PKE and KEMs 📺
In this work, we demonstrate generic and practical EM side-channel assisted chosen ciphertext attacks over multiple LWE/LWR-based Public Key Encryption (PKE) and Key Encapsulation Mechanisms (KEM) secure in the chosen ciphertext model (IND-CCA security). We show that the EM side-channel information can be efficiently utilized to instantiate a plaintext checking oracle, which provides binary information about the output of decryption, typically concealed within IND-CCA secure PKE/KEMs, thereby enabling our attacks. Firstly, we identified EM-based side-channel vulnerabilities in the error correcting codes (ECC) enabling us to distinguish based on the value/validity of decrypted codewords. We also identified similar vulnerabilities in the Fujisaki-Okamoto transform which leaks information about decrypted messages applicable to schemes that do not use ECC. We subsequently exploit these vulnerabilities to demonstrate practical attacks applicable to six CCA-secure lattice-based PKE/KEMs competing in the second round of the NIST standardization process. We perform experimental validation of our attacks on implementations taken from the open-source pqm4 library, running on the ARM Cortex-M4 microcontroller. Our attacks lead to complete key-recovery in a matter of minutes on all the targeted schemes, thus showing the effectiveness of our attack.
2020
TCHES
High-speed Instruction-set Coprocessor for Lattice-based Key Encapsulation Mechanism: Saber in Hardware 📺
In this paper, we present an instruction set coprocessor architecture for lattice-based cryptography and implement the module lattice-based post-quantum key encapsulation mechanism (KEM) Saber as a case study. To achieve fast computation time, the architecture is fully implemented in hardware, including CCA transformations. Since polynomial multiplication plays a performance-critical role in the module and ideal lattice-based public-key cryptography, a parallel polynomial multiplier architecture is proposed that overcomes memory access bottlenecks and results in a highly parallel yet simple and easy-to-scale design. Such multipliers can compute a full multiplication in 256 cycles, but are designed to target any area/performance trade-offs. Besides optimizing polynomial multiplication, we make important design decisions and perform architectural optimizations to reduce the overall cycle counts as well as improve resource utilization. For the module dimension 3 (security comparable to AES-192), the coprocessor computes CCA key generation, encapsulation, and decapsulation in only 5,453, 6,618 and 8,034 cycles respectively, making it the fastest hardware implementation of Saber to our knowledge. On a Xilinx UltraScale+ XCZU9EG-2FFVB1156 FPGA, the entire instruction set coprocessor architecture runs at 250 MHz clock frequency and consumes 23,686 LUTs, 9,805 FFs, and 2 BRAM tiles (including 5,113 LUTs and 3,068 FFs for the Keccak core).
2020
TCHES
High-Speed Masking for Polynomial Comparison in Lattice-based KEMs 📺
With the NIST post-quantum standardization competition entering the second round, the interest in practical implementation results of the remaining NIST candidates is steadily growing. Especially implementations on embedded devices are often not protected against side-channel attacks, such as differential power analysis. In this regard, the application of countermeasures against side-channel attacks to candidates of the NIST standardization process is still an understudied topic. Our work aims to contribute to the NIST competition by enabling a more realistic judgment of the overhead cost introduced by side-channel countermeasures that are applied to lattice-based KEMs that achieve CCA-security based on the Fujisaki-Okamoto transform. We present a novel higher-order masking scheme that enables an efficient comparison of polynomials as previous techniques based on arithmetic-to-Boolean conversions renders this (generally inexpensive) component extremely expensive in the masked case. Our approach has linear complexity in the number of shares compared to quadratic complexity of previous contributions and it applies to lattice based schemes with prime modulus. It comes with a proof in the probing model and an efficient implementation on an ARM Cortex-M4F microcontroller which was defined as a preferred evaluation platform for embedded implementations by NIST. Our algorithm can be executed in only 1.5-2.2 milliseconds on the target platform (depending on the masking order) and is therefore well suited even for lightweight applications. While in previous work, practical side-channel experiments were conducted using only 5,000 - 100,000 power traces, we confirm the absence of first-order leakage in this work by collecting 1 million power traces and applying the t-test methodology.
2020
TCHES
Highly Efficient Architecture of NewHope-NIST on FPGA using Low-Complexity NTT/INTT 📺
NewHope-NIST is a promising ring learning with errors (RLWE)-based postquantum cryptography (PQC) for key encapsulation mechanisms. The performance on the field-programmable gate array (FPGA) affects the applicability of NewHope-NIST. In RLWE-based PQC algorithms, the number theoretic transform (NTT) is one of the most time-consuming operations. In this paper, low-complexity NTT and inverse NTT (INTT) are used to implement highly efficient NewHope-NIST on FPGA. First, both the pre-processing of NTT and the post-processing of INTT are merged into the fast Fourier transform (FFT) algorithm, which reduces N and 2N modular multiplications for N-point NTT and INTT, respectively. Second, a compact butterfly unit and an efficient modular reduction on the modulus 12289 are proposed for the low-complexity NTT/INTT architecture, which achieves an improvement of approximately 3× in the area time product (ATP) compared with the results of the state-of-the-art designs. Finally, a highly efficient architecture with doubled bandwidth and timing hiding for NewHope-NIST is presented. The implementation results on an FPGA show that our design is at least 2.5× faster and has 4.9× smaller ATP compared with the results of the state-of-the-art designs of NewHope-NIST on similar platforms.
2020
TCHES
Improving the Performance of the Picnic Signature Scheme 📺
Picnic is a digital signature algorithm designed to provide security against attacks by quantum computers. The design uses only symmetric-key primitives, and is an efficient instantiation of the MPC-in-the-head paradigm. In this work, we explore the Picnic design in great detail. We investigate and benchmark different parameter choices and show that there exist better parameter choices than those in the current specification. We also present improvements to the MPC protocol that shorten signatures and reduce signing time. The proposed MPC changes tailor the protocol to the circuit of interest in Picnic, but may also be of independent interest. Taken together, these changes give a new instantiation of Picnic that signs messages 7.9 to 13.9 times faster, and verifies signatures 4.5 to 5.5 times faster than the existing “Picnic2” design, while having nearly the same signature sizes.
2020
TCHES
Investigating Profiled Side-Channel Attacks Against the DES Key Schedule 📺
Recent publications describe profiled single trace side-channel attacks (SCAs) against the DES key-schedule of a “commercially available security controller”. They report a significant reduction of the average remaining entropy of cryptographic keys after the attack, with surprisingly large, key-dependent variations of attack results, and individual cases with remaining key entropies as low as a few bits. Unfortunately, they leave important questions unanswered: Are the reported wide distributions of results plausible - can this be explained? Are the results device-specific or more generally applicable to other devices? What is the actual impact on the security of 3-key triple DES? We systematically answer those and several other questions by analyzing two commercial security controllers and a general purpose microcontroller. We observe a significant overall reduction and, importantly, also observe a large key-dependent variation in single DES key security levels, i.e. 49.4 bit mean and 0.9 % of keys < 40 bit (first investigated security controller; other results similar). We also observe a small fraction of keys with exceptionally low security levels that can be called weak keys. It is unclear, whether a device’s side-channel security should be assessed based on such rare weak key outliers. We generalize results to other leakage models by attacking the hardware DES accelerator of a general purpose microcontroller exhibiting a different leakage model. A highly simplified leakage simulation also confirms the wide distribution and shows that security levels are predictable to some extent. Through extensive investigations we find that the actual weakness of keys mainly stems from the specific switching noise they cause. Based on our investigations we expect that widely distributed results and weak outliers should be expected for all profiled attacks against (insufficiently protected) key-schedules, regardless of the algorithm and specific implementation. Finally, we describe a sound approach to estimate actual 3-key triple-DES security levels from empirical single DES results and find that the impact on the security of 3-key triple-DES is limited, i.e. 96.1 bit mean and 0.24 % of key-triples < 80 bit for the same security controller.
2020
TCHES
ISA Extensions for Finite Field Arithmetic: Accelerating Kyber and NewHope on RISC-V 📺
We present and evaluate a custom extension to the RISC-V instruction set for finite field arithmetic. The result serves as a very compact approach to software-hardware co-design of PQC implementations in the context of small embedded processors such as smartcards. The extension provides instructions that implement finite field operations with subsequent reduction of the result. As small finite fields are used in various PQC schemes, such instructions can provide a considerable speedup for an otherwise software-based implementation. Furthermore, we create a prototype implementation of the presented instructions for the extendable VexRiscv core, integrate the result into a chip design, and evaluate the design on two different FPGA platforms. The effectiveness of the extension is evaluated by using the instructions to optimize the Kyber and NewHope key-encapsulation schemes. To that end, we also present an optimized software implementation for the standard RISC-V instruction set for the polynomial arithmetic underlying those schemes, which serves as basis for comparison. Both variants are tuned on an assembler level to optimally use the processor pipelines of contemporary RISC-V CPUs. The result shows a speedup for the polynomial arithmetic of up to 85% over the basic software implementation. Using the custom instructions drastically reduces the code and data size of the implementation without introducing runtime-performance penalties at a small cost in circuit size. When used in the selected schemes, the custom instructions can be used to replace a full general purpose multiplier to achieve very compact implementations.
2020
TCHES
JackHammer: Efficient Rowhammer on Heterogeneous FPGA-CPU Platforms 📺
After years of development, FPGAs are finally making an appearance on multi-tenant cloud servers. Heterogeneous FPGA-CPU microarchitectures require reassessment of common assumptions about isolation and security boundaries, as they introduce new attack vectors and vulnerabilities. In this work, we analyze the memory and cache subsystem and study Rowhammer and cache attacks enabled by two proposed heterogeneous FPGA-CPU platforms from Intel: the Arria 10 GX with an integrated FPGA-CPU platform, and the Arria 10 GX PAC expansion card which connects the FPGA to the CPU via the PCIe interface. We demonstrate JackHammer, a novel, efficient, and stealthy Rowhammer from the FPGA to the host’s main memory. Our results indicate that a malicious FPGA can perform twice as fast as a typical Rowhammer from the CPU on the same system and causes around four times as many bit flips as the CPU attack. We demonstrate the efficacy of JackHammer from the FPGA through a realistic fault attack on the WolfSSL RSA signing implementation that reliably causes a fault after an average of fifty-eight RSA signatures, 25% faster than a CPU Rowhammer. In some scenarios our JackHammer attack produces faulty signatures more than three times more often and almost three times faster than a conventional CPU Rowhammer. Finally, we systematically analyze new cache attacks in these environments following demonstration of a cache covert channel across FPGA and CPU.
2020
TCHES
Keep it Unsupervised: Horizontal Attacks Meet Deep Learning 📺
To mitigate side-channel attacks, real-world implementations of public-key cryptosystems adopt state-of-the-art countermeasures based on randomization of the private or ephemeral keys. Usually, for each private key operation, a “scalar blinding” is performed using 32 or 64 randomly generated bits. Nevertheless, horizontal attacks based on a single trace still pose serious threats to protected ECC or RSA implementations. If the secrets learned through a single-trace attack contain too many wrong (or noisy) bits, the cryptanalysis methods for recovering remaining bits become impractical due to time and computational constraints. This paper proposes a deep learning-based framework to iteratively correct partially correct private keys resulting from a clustering-based horizontal attack. By testing the trained network on scalar multiplication (or exponentiation) traces, we demonstrate that a deep neural network can significantly reduce the number of wrong bits from randomized scalars (or exponents).When a simple horizontal attack can recover around 52% of attacked multiple private key bits, the proposed iterative framework improves the private key accuracy to above 90% on average and to 100% for at least one of the attacked keys. Our attack model remains fully unsupervised and excludes the need to know where the error or noisy bits are located in each separate randomized private key.
2020
TCHES
Low-Latency Hardware Masking with Application to AES 📺
During the past two decades there has been a great deal of research published on masked hardware implementations of AES and other cryptographic primitives. Unfortunately, many hardware masking techniques can lead to increased latency compared to unprotected circuits for algorithms such as AES, due to the high-degree of nonlinear functions in their designs. In this paper, we present a hardware masking technique which does not increase the latency for such algorithms. It is based on the LUT-based Masked Dual-Rail with Pre-charge Logic (LMDPL) technique presented at CHES 2014. First, we show 1-glitch extended strong noninterference of a nonlinear LMDPL gadget under the 1-glitch extended probing model. We then use this knowledge to design an AES implementation which computes a full AES-128 operation in 10 cycles and a full AES-256 operation in 14 cycles. We perform practical side-channel analysis of our implementation using the Test Vector Leakage Assessment (TVLA) methodology and analyze univariate as well as bivariate t-statistics to demonstrate its DPA resistance level.
2020
TCHES
Minerva: The curse of ECDSA nonces Systematic analysis of lattice attacks on noisy leakage of bit-length of ECDSA nonces 📺
Best Paper CHES 2020
We present our discovery of a group of side-channel vulnerabilities in implementations of the ECDSA signature algorithm in a widely used Atmel AT90SC FIPS 140-2 certified smartcard chip and five cryptographic libraries (libgcrypt, wolfSSL, MatrixSSL, SunEC/OpenJDK/Oracle JDK, Crypto++). Vulnerable implementations leak the bit-length of the scalar used in scalar multiplication via timing. Using leaked bit-length, we mount a lattice attack on a 256-bit curve, after observing enough signing operations. We propose two new methods to recover the full private key requiring just 500 signatures for simulated leakage data, 1200 for real cryptographic library data, and 2100 for smartcard data. The number of signatures needed for a successful attack depends on the chosen method and its parameters as well as on the noise profile, influenced by the type of leakage and used computation platform. We use the set of vulnerabilities reported in this paper, together with the recently published TPM-FAIL vulnerability [MSE+20] as a basis for real-world benchmark datasets to systematically compare our newly proposed methods and all previously published applicable lattice-based key recovery methods. The resulting exhaustive comparison highlights the methods’ sensitivity to its proper parametrization and demonstrates that our methods are more efficient in most cases. For the TPM-FAIL dataset, we decreased the number of required signatures from approximately 40 000 to mere 900.
2020
TCHES
Modeling Soft Analytical Side-Channel Attacks from a Coding Theory Viewpoint 📺
One important open question in side-channel analysis is to find out whether all the leakage samples in an implementation can be exploited by an adversary, as suggested by masking security proofs. For attacks exploiting a divide-and-conquer strategy, the answer is negative: only the leakages corresponding to the first/last rounds of a block cipher can be exploited. Soft Analytical Side-Channel Attacks (SASCA) have been introduced as a powerful solution to mitigate this limitation. They represent the target implementation and its leakages as a code (similar to a Low Density Parity Check code) that is decoded thanks to belief propagation. Previous works have shown the low data complexities that SASCA can reach in practice. In this paper, we revisit these attacks by modeling them with a variation of the Random Probing Model used in masking security proofs, that we denote as the Local Random Probing Model (LRPM). Our study establishes interesting connections between this model and the erasure channel used in coding theory, leading to the following benefits. First, the LRPM allows bounding the security of concrete implementations against SASCA in a fast and intuitive manner. We use it in order to confirm that the leakage of any operation in a block cipher can be exploited, although the leakages of external operations dominate in known-plaintext/ciphertext attack scenarios. Second, we show that the LRPM is a tool of choice for the (nearly worst-case) analysis of masked implementations in the noisy leakage model, taking advantage of all the operations performed, and leading to new tradeoffs between their amount of randomness and physical noise level. Third, we show that it can considerably speed up the evaluation of other countermeasures such as shuffling.
2020
TCHES
On the Security Goals of White-Box Cryptography 📺
We discuss existing and new security notions for white-box cryptography and comment on their suitability for Digital Rights Management and Mobile Payment Applications, the two prevalent use-cases of white-box cryptography. In particular, we put forward indistinguishability for white-box cryptography with hardware-binding (IND-WHW) as a new security notion that we deem central. We also discuss the security property of application-binding and explain the issues faced when defining it as a formal security notion. Based on our proposed notion for hardware-binding, we describe a possible white-box competition setup which assesses white-box implementations w.r.t. hardware-binding. Our proposed competition setup allows us to capture hardware-binding in a practically meaningful way.While some symmetric encryption schemes have been proven to admit plain white-box implementations, we show that not all secure symmetric encryption schemes are white-boxeable in the plain white-box attack scenario, i.e., without hardware-binding. Thus, even strong assumptions such as indistinguishability obfuscation cannot be used to provide secure white-box implementations for arbitrary ciphers. Perhaps surprisingly, our impossibility result does not carry over to the hardware-bound scenario. In particular, Alpirez Bock, Brzuska, Fischlin, Janson and Michiels (ePrint 2019/1014) proved a rather general feasibility result in the hardware-bound model. Equally important, the apparent theoretical distinction between the plain white-box model and the hardware-bound white-box model also translates into practically reduced attack capabilities as we explain in this paper.
2020
TCHES
On the spectral features of robust probing security 📺
In this work we provide a spectral formalization of non-interference in the presence of glitches. Our goal is to present new theoretical and practical tools to reason about robust-d-probing security. We show that the current understanding of extended probes lends itself to probes that participate, during gadget composition, to the creation of additional extended probes. In turn, this enables a natural extension of non-interference definitions into robust ones to build a new reasoning framework that can formally explain some semi-formal results already appeared in the past and be used to synthesize new robust-d-SNI gadgets.
2020
TCHES
Parameterized Hardware Accelerators for Lattice-Based Cryptography and Their Application to the HW/SW Co-Design of qTESLA 📺
This paper presents a set of efficient and parameterized hardware accelerators that target post-quantum lattice-based cryptographic schemes, including a versatile cSHAKE core, a binary-search CDT-based Gaussian sampler, and a pipelined NTT-based polynomial multiplier, among others. Unlike much of prior work, the accelerators are fully open-sourced, are designed to be constant-time, and can be parameterized at compile-time to support different parameters without the need for re-writing the hardware implementation. These flexible, publicly-available accelerators are leveraged to demonstrate the first hardware-software co-design using RISC-V of the post-quantum lattice-based signature scheme qTESLA with provably secure parameters. In particular, this work demonstrates that the NIST’s Round 2 level 1 and level 3 qTESLA variants achieve over a 40-100x speedup for key generation, about a 10x speedup for signing, and about a 16x speedup for verification, compared to the baseline RISC-V software-only implementation. For instance, this corresponds to execution in 7.7, 34.4, and 7.8 milliseconds for key generation, signing, and verification, respectively, for qTESLA’s level 1 parameter set on an Artix-7 FPGA, demonstrating the feasibility of the scheme for embedded applications.
2020
TCHES
Persistent Fault Attack in Practice 📺
Persistence fault analysis (PFA) is a novel fault analysis technique proposed in CHES 2018 and demonstrated with rowhammer-based fault injections. However, whether such analysis can be applied to traditional fault attack scenario, together with its difficulty in practice, has not been carefully investigated. For the first time, a persistent fault attack is conducted on an unprotected AES implemented on ATmega163L microcontroller in this paper. Several critical challenges are solved with our new improvements, including (1) how to decide whether the fault is injected in SBox; (2) how to use the maximum likelihood estimation to pursue the minimum number of ciphertexts; (3) how to utilize the unknown fault in SBox to extract the key. Our experiments show that: to break AES with physical laser injections despite all these challenges, the minimum and average number of required ciphertexts are 926 and 1641, respectively. It is about 38% and 28% reductions of the ciphertexts required in comparison to 1493 and 2273 in previous work where both fault value and location have to be known. Furthermore, our analysis is extended to the PRESENT cipher. By applying the persistent fault analysis to the penultimate round, the full PRESENT key of 80 bits can be recovered. Eventually, an experimental validation is performed to confirm the accuracy of our attack with more insights. This paper solves the challenges in most aspects of practice and also demonstrates the feasibility and universality of PFA on SPN block ciphers.
2020
TCHES
Plaintext: A Missing Feature for Enhancing the Power of Deep Learning in Side-Channel Analysis? Breaking multiple layers of side-channel countermeasures 📺
Deep learning (DL) has proven to be very effective for image recognition tasks, with a large body of research on various model architectures for object classification. Straight-forward application of DL to side-channel analysis (SCA) has already shown promising success, with experimentation on open-source variable key datasets showing that secret keys can be revealed with 100s traces even in the presence of countermeasures. This paper aims to further improve the application of DL for SCA, by enhancing the power of DL when targeting the secret key of cryptographic algorithms when protected with SCA countermeasures. We propose a new model, CNN-based model with Plaintext feature extension (CNNP) together with multiple convolutional filter kernel sizes and structures with deeper and narrower neural networks, which has empirically proven its effectiveness by outperforming reference profiling attack methods such as template attacks (TAs), convolutional neural networks (CNNs) and multilayer perceptron (MLP) models. Our model generates state-of-the art results when attacking the ASCAD variable-key database, which has a restricted number of training traces per key, recovering the key within 40 attack traces in comparison with order of 100s traces required by straightforward machine learning (ML) application. During the profiling stage an attacker needs no additional knowledge on the implementation, such as the masking scheme or random mask values, only the ability to record the power consumption or electromagnetic field traces, plaintext/ciphertext and the key. Additionally, no heuristic pre-processing is required in order to break the high-order masking countermeasures of the target implementation.
2020
TCHES
Polynomial Multiplication in NTRU Prime: Comparison of Optimization Strategies on Cortex-M4 📺
This paper proposes two different methods to perform NTT-based polynomial multiplication in polynomial rings that do not naturally support such a multiplication. We demonstrate these methods on the NTRU Prime key-encapsulation mechanism (KEM) proposed by Bernstein, Chuengsatiansup, Lange, and Vredendaal, which uses a polynomial ring that is, by design, not amenable to use with NTT. One of our approaches is using Good’s trick and focuses on speed and supporting more than one parameter set with a single implementation. The other approach is using a mixed radix NTT and focuses on the use of smaller multipliers and less memory. On a ARM Cortex-M4 microcontroller, we show that our three NTT-based implementations, one based on Good’s trick and two mixed radix NTTs, provide between 32% and 17% faster polynomial multiplication. For the parameter-set ntrulpr761, this results in between 16% and 9% faster total operations (sum of key generation, encapsulation, and decapsulation) and requires between 15% and 39% less memory than the current state-of-the-art NTRU Prime implementation on this platform, which is using Toom-Cook-based polynomial multiplication.
2020
TCHES
Protecting against Statistical Ineffective Fault Attacks 📺
Statistical Ineffective Fault Attacks (SIFA) pose a threat for many practical implementations of symmetric primitives. Countermeasures against both power analysis and fault attacks typically do not prevent straightforward SIFA attacks, which require only very limited knowledge about the concrete implementation. Therefore, the exploration of countermeasures against SIFA that do not rely on protocols or physical protection mechanisms is of great interest. In this paper, we describe different countermeasure strategies against SIFA. First, we introduce an abstraction layer between the algorithmic specification of a cipher and its implementation in hardware or software to study and describe resistance against SIFA. We then show that by basing the masked implementation on permutations as building blocks, we can build circuits that withstand single-fault SIFA and DPA attacks. We show how this approach can be applied to 3-bit, 4-bit, and 5-bit S-boxes and the AES S-box. Additionally, we present a strategy based on fine-grained fault detection suitable for protecting any circuit against SIFA attacks. Although this approach may lead to a higher implementation cost due to the fine-grained detection needed, it can be used to protect arbitrary circuits and can be generalized to cover multi-fault SIFA. For single-fault SIFA protection, our countermeasures only have a small computational overhead compared to a simple combination of masking and duplication.
2020
TCHES
Ranking Loss: Maximizing the Success Rate in Deep Learning Side-Channel Analysis 📺
The side-channel community recently investigated a new approach, based on deep learning, to significantly improve profiled attacks against embedded systems. Compared to template attacks, deep learning techniques can deal with protected implementations, such as masking or desynchronization, without substantial preprocessing. However, important issues are still open. One challenging problem is to adapt the methods classically used in the machine learning field (e.g. loss function, performance metrics) to the specific side-channel context in order to obtain optimal results. We propose a new loss function derived from the learning to rank approach that helps preventing approximation and estimation errors, induced by the classical cross-entropy loss. We theoretically demonstrate that this new function, called Ranking Loss (RkL), maximizes the success rate by minimizing the ranking error of the secret key in comparison with all other hypotheses. The resulting model converges towards the optimal distinguisher when considering the mutual information between the secret and the leakage. Consequently, the approximation error is prevented. Furthermore, the estimation error, induced by the cross-entropy, is reduced by up to 23%. When the ranking loss is used, the convergence towards the best solution is up to 23% faster than a model using the cross-entropy loss function. We validate our theoretical propositions on public datasets.
2020
TCHES
Rapidly Verifiable XMSS Signatures 📺
This work presents new speed records for XMSS (RFC 8391) signature verification on embedded devices. For this we make use of a probabilistic method recently proposed by Perin, Zambonin, Martins, Custódio, and Martina (PZMCM) at ISCC 2018, that changes the XMSS signing algorithm to search for rapidly verifiable signatures. We improve the method, ensuring that the added signing cost for the search is independent of the message length. We provide a statistical analysis of the resulting verification speed and support it by experiments. We present a record setting RFC compatible implementation of XMSS verification on the ARM Cortex-M4. At a signing time of about one minute on a general purpose CPU, we create signatures that are verified about 1.44 times faster than traditionally generated signatures. Adding further well-known implementation optimizations to the verification algorithm we reduce verification time by over a factor two from 13.85 million to 6.56 million cycles. In contrast to previous works, we provide a detailed security analysis of the resulting signature scheme under classical and quantum attacks that justifies our selection of parameters. On the way, we fill a gap in the security analysis of XMSS as described in RFC 8391 proving that the modified message hashing in the RFC does indeed mitigate multi-target attacks. This was not shown before and might be of independent interest.
2020
TCHES
Re-Consolidating First-Order Masking Schemes: Nullifying Fresh Randomness 📺
Application of masking, known as the most robust and reliable countermeasure to side-channel analysis attacks, on various cryptographic algorithms has dedicated a lion’s share of research to itself. The difficulty originates from the fact that the overhead of application of such an algorithmic-level countermeasure might not be affordable. This includes the area- and latency overheads and the amount of fresh randomness required to fulfill the resulting design’s security properties. There are already techniques applicable in hardware platforms that consider glitches into account. Among them, classical threshold implementations force the designers to use at least three shares in the underlying masking. The other schemes, which can deal with two shares, often necessitates the use of fresh randomness.Here, in this work, we present a technique allowing us to use two shares to realize the first-order glitch-extended probing secure masked realization of several functions, including the S-box of Midori, PRESENT, PRINCE, and AES ciphers without any fresh randomness.
2020
TCHES
Redundant Code-based Masking Revisited 📺
Masking schemes are a popular countermeasure against side-channel attacks. To mask bytes, the two classical options are Boolean masking and polynomial masking. The latter lends itself to redundant masking, where leakage emanates from more shares than are strictly necessary to reconstruct, raising the obvious question how well such “redundant” leakage can be exploited by a side-channel adversary. We revisit the recent work by Chabanne et al. (CHES’18) and show that, contrary to their conclusions, said leakage can—in theory—always be exploited. For the Hamming weight scenario in the low-noise regime, we heuristically determine how security degrades in terms of the number of redundant shares for first and second order secure polynomial masking schemes.Furthermore, we leverage a well-established link between linear secret sharing schemes and coding theory to determine when different masking schemes will end up with essentially equivalent leakage profiles. Surprisingly, we conclude that for typical field sizes and security orders, Boolean masking is a special case of polynomial masking. We also identify quasi-Boolean masking schemes as a special class of redundant polynomial masking and point out that the popular “Frobenius-stable” sets of interpolations points typically lead to such quasi-Boolean masking schemes, with subsequent degraded leakage performance.
2020
TCHES
Rejection Sampling Schemes for Extracting Uniform Distribution from Biased PUFs 📺
This paper presents an efficient fuzzy extractor (FE) construction for secure cryptographic key generation from physically unclonable functions (PUFs). The proposed FE, named acceptance-or-rejection (AR)-based FE, utilizes a new debiasing scheme to extract a uniform distribution from a biased PUF response. The proposed debiasing scheme employs the principle of rejection sampling, and can extract a longer debiased bit string compared to those of conventional debiasing schemes. In addition, the proposed AR-based FE is extended to ternary PUF responses (i.e., ternary encoding of a PUF response). These responses can be derived according to cell-wise reliability of the PUF and are promising for extraction of stable and high-entropy responses from common PUFs. The performance of the AR-based Fes is evaluated through an experimental simulation of PUF-based key generation and compared with conventional FEs. We confirm that the proposed AR-based FE can achieve the highest efficiency in terms of PUF and nonvolatile memory (NVM) sizes for various PUF conditions among the conventional counterparts. More precisely, the AR-based FE can realize a 128-bit key generation with up-to 55% smaller PUF size or up-to 72% smaller NVM size than other conventional FEs. In addition, the ternary AR-based FE is up to 55% more efficient than the binary version, and can also achieve up-to 63% higher efficiency than conventional counterparts. Furthermore, we show that the AR-based FE can be applied to PUFs with local biases (e.g., biases depending on cell location in SRAM PUFs), unlike all the conventional schemes, for which only global (or identical) biases are assumed.
2020
TCHES
Remove Some Noise: On Pre-processing of Side-channel Measurements with Autoencoders 📺
In the profiled side-channel analysis, deep learning-based techniques proved to be very successful even when attacking targets protected with countermeasures. Still, there is no guarantee that deep learning attacks will always succeed. Various countermeasures make attacks significantly more complex, and such countermeasures can be further combined to make the attacks even more challenging. An intuitive solution to improve the performance of attacks would be to reduce the effect of countermeasures.This paper investigates whether we can consider certain types of hiding countermeasures as noise and then use a deep learning technique called the denoising autoencoder to remove that noise. We conduct a detailed analysis of six different types of noise and countermeasures separately or combined and show that denoising autoencoder improves the attack performance significantly.
2020
TCHES
Retrofitting Leakage Resilient Authenticated Encryption to Microcontrollers 📺
The security of Internet of Things (IoT) devices relies on fundamental concepts such as cryptographically protected firmware updates. In this context attackers usually have physical access to a device and therefore side-channel attacks have to be considered. This makes the protection of required cryptographic keys and implementations challenging, especially for commercial off-the-shelf (COTS) microcontrollers that typically have no hardware countermeasures. In this work, we demonstrate how unprotected hardware AES engines of COTS microcontrollers can be efficiently protected against side-channel attacks by constructing a leakage resilient pseudo random function (LR-PRF). Using this side-channel protected building block, we implement a leakage resilient authenticated encryption with associated data (AEAD) scheme that enables secured firmware updates. We use concepts from leakage resilience to retrofit side-channel protection on unprotected hardware AES engines by means of software-only modifications. The LR-PRF construction leverages frequent key changes and low data complexity together with key dependent noise from parallel hardware to protect against side-channel attacks. Contrary to most other protection mechanisms such as time-based hiding, no additional true randomness is required. Our concept relies on parallel S-boxes in the AES hardware implementation, a feature that is fortunately present in many microcontrollers as a measure to increase performance. In a case study, we implement the protected AEAD scheme for two popular ARM Cortex-M microcontrollers with differing parallelism. We evaluate the protection capabilities in realistic IoT attack scenarios, where non-invasive EM probes or power consumption measurements are employed by the attacker. We show that the concept provides the side-channel hardening that is required for the long-term security of IoT devices.
2020
TCHES
Revisiting a Methodology for Efficient CNN Architectures in Profiling Attacks 📺
This work provides a critical review of the paper by Zaid et al. titled “Methodology for Efficient CNN Architectures in Profiling attacks”, which was published in TCHES Volume 2020, Issue 1. This work studies the design of CNN networks to perform side-channel analysis of multiple implementations of the AES for embedded devices. Based on the authors’ code and public data sets, we were able to cross-check their results and perform a thorough analysis. We correct multiple misconceptions by carefully inspecting different elements of the model architectures proposed by Zaid et al. First, by providing a better understanding on the internal workings of these models, we can trivially reduce their number of parameters on average by 52%, while maintaining a similar performance. Second, we demonstrate that the convolutional filter’s size is not strictly related to the amount of misalignment in the traces. Third, we show that increasing the filter size and the number of convolutions actually improves the performance of a network. Our work demonstrates once again that reproducibility and review are important pillars of academic research. Therefore, we provide the reader with an online Python notebook which allows to reproduce some of our experiments1 and additional example code is made available on Github.2
2020
TCHES
RISQ-V: Tightly Coupled RISC-V Accelerators for Post-Quantum Cryptography 📺
Empowering electronic devices to support Post-Quantum Cryptography (PQC) is a challenging task. PQC introduces new mathematical elements and operations which are usually not easy to implement on standard processors. Especially for low cost and resource constraint devices, hardware acceleration is usually required. In addition, as the standardization process of PQC is still ongoing, a focus on maintaining flexibility is mandatory. To cope with such requirements, hardware/software co-design techniques have been recently used for developing complex and highly customized PQC solutions. However, while most of the previous works have developed loosely coupled PQC accelerators, the design of tightly coupled accelerators and Instruction Set Architecture (ISA) extensions for PQC have been barely explored. To this end, we present RISQ-V, an enhanced RISC-V architecture that integrates a set of powerful tightly coupled accelerators to speed up lattice-based PQC. RISQ-V efficiently reuses processor resources and reduces the amount of memory accesses. This significantly increases the performance while keeping the silicon area overhead low. We present three contributions. First, we propose a set of powerful hardware accelerators deeply integrated into the RISC-V pipeline. Second, we extended the RISC-V ISA with 29 new instructions to efficiently perform operations for lattice-based cryptography. Third, we implemented our RISQ-V in ASIC technology and on FPGA. We evaluated the performance of NewHope, Kyber, and Saber on RISQ-V. Compared to the pure software implementation on RISC-V, our co-design implementations show a speedup factor of up to 11.4 for NewHope, 9.6 for Kyber, and 2.7 for Saber. For the ASIC implementation, the energy consumption was reduced by factors of up to 9.5 for NewHope, 7.7 for Kyber, and 2.1 for Saber. The cell count of the CPU was increased by a factor of 1.6 compared to the original RISC-V design, which can be considered as a moderate increase for the achieved performance gain.
2020
TCHES
Second-Order Masked Lookup Table Compression Scheme 📺
Masking by lookup table randomisation is a well-known technique used to achieve side-channel attack resistance for software implementations, particularly, against DPA attacks. The randomised table technique for first- and second-order security requires about m•2n bits of RAM to store an (n,m)-bit masked S-box lookup table. Table compression helps in reducing the amount of memory required, and this is useful for highly resource-constrained IoT devices. Recently, Vadnala (CT-RSA 2017) proposed a randomised table compression scheme for first- and second-order security in the probing leakage model. This scheme reduces the RAM memory required by about a factor of 2l, where l is a compression parameter. Vivek (Indocrypt 2017) demonstrated an attack against the second-order scheme of Vadnala. Hence achieving table compression at second and higher orders is an open problem.In this work, we propose a second-order secure randomised table compression scheme which works for any (n,m)-bit S-box. Our proposal is a variant of Vadnala’s scheme that is not only secure but also significantly improves the time-memory trade-off. Specifically, we improve the online execution time by a factor of 2n−l. Our proposed scheme is proved 2-SNI secure in the probing leakage model. We have implemented our method for AES-128 on a 32-bit ARM Cortex processor. We are able to reduce the memory required to store a randomised S-box table for second-order AES-128 implementation to 59 bytes.
2020
TCHES
Side-Channel Analysis of the Xilinx Zynq UltraScale+ Encryption Engine 📺
The Xilinx Zynq UltraScale+ (ZU+) is a powerful and flexible System-on- Chip (SoC) computing platform for next generation applications such as autonomous driving or industrial Internet-of-Things (IoT) based on 16 nm production technology. The devices are equipped with a secure boot mechanism in order to provide confidentiality, integrity, and authenticity of the configuration files that are loaded during power-up. This includes a dedicated encryption engine which features a protocol-based countermeasure against passive Side-Channel Attacks (SCAs) called key rolling. The mechanism ensures that the same key is used only for a certain number of data blocks that has to be defined by the user. However, a suitable choice for the key rolling parameter depends on the power leakage behavior of the chip and is not published by the manufacturer. To close this gap, this paper presents the first publicly known side-channel analysis of the ZU+ encryption unit. We conduct a black-box reverse engineering of the internal hardware architecture of the encryption engine using Electromagnetic (EM) measurements from a decoupling capacitor of the power supply. Then, we illustrate a sophisticated methodology that involves the first five rounds of an AES encryption to attack the 256-bit secret key. We apply the elaborated attack strategy using several new Deep Learning (DL)-based evaluation methods for cryptographic implementations. Even though we are unable to recover all bytes of the secret key, the experimental results still allow us to provide concrete recommendations for the key rolling parameter under realistic conditions. This eventually helps to configure the secure boot mechanism of the ZU+ and similar devices appropriately.
2020
TCHES
Side-Channel Countermeasures’ Dissection and the Limits of Closed Source Security Evaluations 📺
We take advantage of a recently published open source implementation of the AES protected with a mix of countermeasures against side-channel attacks to discuss both the challenges in protecting COTS devices against such attacks and the limitations of closed source security evaluations. The target implementation has been proposed by the French ANSSI (Agence Nationale de la Sécurité des Systèmes d’Information) to stimulate research on the design and evaluation of side-channel secure implementations. It combines additive and multiplicative secret sharings into an affine masking scheme that is additionally mixed with a shuffled execution. Its preliminary leakage assessment did not detect data dependencies with up to 100,000 measurements. We first exhibit the gap between such a preliminary leakage assessment and advanced attacks by demonstrating how a countermeasures’ dissection exploiting a mix of dimensionality reduction, multivariate information extraction and key enumeration can recover the full key with less than 2,000 measurements. We then discuss the relevance of open source evaluations to analyze such implementations efficiently, by pointing out that certain steps of the attack are hard to automate without implementation knowledge (even with machine learning tools), while performing them manually is straightforward. Our findings are not due to design flaws but from the general difficulty to prevent side-channel attacks in COTS devices with limited noise. We anticipate that high security on such devices requires significantly more shares.
2020
TCHES
Single-Trace Attacks on Keccak 📺
Since its selection as the winner of the SHA-3 competition, Keccak, with all its variants, has found a large number of applications. It is, for instance, a common building block in schemes submitted to NIST’s post-quantum cryptography project. In many of these applications, Keccak processes ephemeral secrets. In such a setting, side-channel adversaries are limited to a single observation, meaning that differential attacks are inherently prevented. If, however, such a single trace of Keccak can already be sufficient for key recovery has so far been unknown. In this paper, we change the above by presenting the first single-trace attack targeting Keccak. Our method is based on soft-analytical side-channel attacks and, thus, combines template matching with message passing in a graphical model of the attacked algorithm. As a straight-forward model of Keccak does not yield satisfactory results, we describe several optimizations for the modeling and the message-passing algorithm. Their combination allows attaining high attack performance in terms of both success rate as well as computational runtime. We evaluate our attack assuming generic software (microcontroller) targets and thus use simulations in the generic noisy Hamming-weight leakage model. Hence, we assume relatively modest profiling capabilities of the adversary. Nonetheless, the attack can reliably recover secrets in a large number of evaluated scenarios at realistic noise levels. Consequently, we demonstrate the need for countermeasures even in settings where DPA is not a threat.
2020
TCHES
Splitting the Interpose PUF: A Novel Modeling Attack Strategy 📺
We demonstrate that the Interpose PUF proposed at CHES 2019, an Arbiter PUF-based design for so-called Strong Physical Unclonable Functions (PUFs), can be modeled by novel machine learning strategies up to very substantial sizes and complexities. Our attacks require in the most difficult cases considerable, but realistic, numbers of CRPs, while consuming only moderate computation times, ranging from few seconds to few days. The attacks build on a new divide-and-conquer approach that allows us to model the two building blocks of the Interpose PUF separately. For non-reliability based Machine Learning (ML) attacks, this eventually leads to attack times on (kup, kdown)-Interpose PUFs that are comparable to the ones against max{kup, kdown}-XOR Arbiter PUFs, refuting the original claim that Interpose PUFs could provide security similar to (kdown + kup/2)-XOR Arbiter PUFs (CHES 2019). On the technical side, our novel divide-and-conquer technique might also be useful in analyzing other designs, where XOR Arbiter PUF challenge bits are unknown to the attacker.
2020
TCHES
Strength in Numbers: Improving Generalization with Ensembles in Machine Learning-based Profiled Side-channel Analysis 📺
The adoption of deep neural networks for profiled side-channel attacks provides powerful options for leakage detection and key retrieval of secure products. When training a neural network for side-channel analysis, it is expected that the trained model can implement an approximation function that can detect leaking side-channel samples and, at the same time, be insensible to noisy (or non-leaking) samples. This outlines a generalization situation where the model can identify the main representations learned from the training set in a separate test set.This paper discusses how output class probabilities represent a strong metric when conducting the side-channel analysis. Further, we observe that these output probabilities are sensitive to small changes, like selecting specific test traces or weight initialization for a neural network. Next, we discuss the hyperparameter tuning, where one commonly uses only a single out of dozens of trained models, where each of those models will result in different output probabilities. We show how ensembles of machine learning models based on averaged class probabilities can improve generalization. Our results emphasize that ensembles increase a profiled side-channel attack’s performance and reduce the variance of results stemming from different hyperparameters, regardless of the selected dataset or leakage model.
2020
TCHES
Strengthening Sequential Side-Channel Attacks Through Change Detection 📺
The sequential structure of some side-channel attacks makes them subject to error propagation, i.e. when an error occurs during the recovery of some part of a secret key, all the following guesses might as well be chosen randomly. We propose a methodology that strengthens sequential attacks by automatically identifying and correcting errors. The core ingredient of our methodology is a change-detection test that monitors the distribution of the distinguisher values used to reconstruct the secret key. Our methodology includes an error-correction procedure that can cope both with false positives of the change-detection test, and inaccuracies of the estimated location of the wrong key guess. The proposed methodology is general and can be included in several attacks. As meaningful examples, we conduct two different side-channel attacks against RSA-2048: an horizontal power-analysis attack based on correlation and a vertical timing attack. Our experiments show that, in all the considered cases, strengthened attacks outperforms their original counterparts and alternative solutions that are based on thresholds. In particular, strengthened attacks achieve high success rates even when the side-channel measurements are noisy or limited in number, without prohibitively increasing the computing time.
2020
TCHES
The Area-Latency Symbiosis: Towards Improved Serial Encryption Circuits 📺
The bit-sliding paper of Jean et al. (CHES 2017) showed that the smallest-size circuit for SPN based block ciphers such as AES, SKINNY and PRESENT can be achieved via bit-serial implementations. Their technique decreases the bit size of the datapath and naturally leads to a significant loss in latency (as well as the maximum throughput). Their designs complete a single round of the encryption in 168 (resp. 68) clock cycles for 128 (resp. 64) bit blocks. A follow-up work by Banik et al. (FSE 2020) introduced the swap-and-rotate technique that both eliminates this loss in latency and achieves even smaller footprints.In this paper, we extend these results on bit-serial implementations all the way to four authenticated encryption schemes from NIST LWC. Our first focus is to decrease latency and improve throughput with the use of the swap-and-rotate technique. Our block cipher implementations have the most efficient round operations in the sense that a round function of an n-bit block cipher is computed in exactly n clock cycles. This leads to implementations that are similar in size to the state of the art, but have much lower latency (savings up to 20 percent). We then extend our technique to 4- and 8-bit implementations. Although these results are promising, block ciphers themselves are not end-user primitives, as they need to be used in conjunction with a mode of operation. Hence, in the second part of the paper, we use our serial block ciphers to bootstrap four active NIST authenticated encryption candidates: SUNDAE-GIFT, Romulus, SAEAES and SKINNY-AEAD. In the wake of this effort, we provide the smallest block-cipher-based authenticated encryption circuits known in the literature so far.
2020
TCHES
The design of scalar AES Instruction Set Extensions for RISC-V 📺
Secure, efficient execution of AES is an essential requirement on most computing platforms. Dedicated Instruction Set Extensions (ISEs) are often included for this purpose. RISC-V is a (relatively) new ISA that lacks such a standardized ISE. We survey the state-of-the-art industrial and academic ISEs for AES, implement and evaluate five different ISEs, one of which is novel. We recommend separate ISEs for 32 and 64-bit base architectures, with measured performance improvements for an AES-128 block encryption of 4x and 10x with a hardware cost of 1.1K and 8.2K gates respectively, when compared to a software-only implementation based on use of T-tables. We also explore how the proposed standard bit-manipulation extension to RISC-V can be harnessed for efficient implementation of AES-GCM. Our work supports the ongoing RISC-V cryptography extension standardisation process.
2020
TCHES
The Long and Winding Path to Secure Implementation of GlobalPlatform SCP10 📺
GlobalPlatform (GP) card specifications are defined for smart cards regarding rigorous security requirements. The increasingly more powerful cards within an open ecosystem of multiple players stipulate that asymmetric-key protocols become necessary. In this paper, we analyze SCP10, which is the Secure Channel Protocol (SCP) that relies on RSA for key exchange and authentication. Our findings are twofold. First, we demonstrate several flaws in the design of SCP10. We discuss the scope of the identified flaws by presenting several attack scenarios in which a malicious attacker can recover all the messages protected by SCP10. We provide a full implementation of these attacks. For instance, an attacker can get the freshly generated session keys in less than three hours. Second, we propose a secure implementation of SCP10 and discuss how it can mitigate the discovered flaws. Finally, we measure the overhead incurred by the implemented countermeasures.
2020
TCHES
Time-memory trade-off in Toom-Cook multiplication: an application to module-lattice based cryptography 📺
Since the introduction of the ring-learning with errors problem, the number theoretic transform (NTT) based polynomial multiplication algorithm has been studied extensively. Due to its faster quasilinear time complexity, it has been the preferred choice of cryptographers to realize ring-learning with errors cryptographic schemes. Compared to NTT, Toom-Cook or Karatsuba based polynomial multiplication algorithms, though being known for a long time, still have a fledgling presence in the context of post-quantum cryptography.In this work, we observe that the pre- and post-processing steps in Toom-Cook based multiplications can be expressed as linear transformations. Based on this observation we propose two novel techniques that can increase the efficiency of Toom-Cook based polynomial multiplications. Evaluation is reduced by a factor of 2, and we call this method precomputation, and interpolation is reduced from quadratic to linear, and we call this method lazy interpolation.As a practical application, we applied our algorithms to the Saber post-quantum key-encapsulation mechanism. We discuss in detail the various implementation aspects of applying our algorithms to Saber. We show that our algorithm can improve the efficiency of the computationally costly matrix-vector multiplication by 12−37% compared to previous methods on their respective platforms. Secondly, we propose different methods to reduce the memory footprint of Saber for Cortex-M4 microcontrollers. Our implementation shows between 2.6 and 5.7 KB reduction in the memory usage with respect to the smallest implementation in the literature.
2020
TCHES
Understanding Screaming Channels: From a Detailed Analysis to Improved Attacks 📺
Recently, some wireless devices have been found vulnerable to a novel class of side-channel attacks, called Screaming Channels. These leaks might appear if the sensitive leaks from the processor are unintentionally broadcast by a radio transmitter placed on the same chip. Previous work focuses on identifying the root causes, and on mounting an attack at a distance considerably larger than the one achievable with conventional electromagnetic side channels, which was demonstrated in the low-noise environment of an anechoic chamber. However, a detailed understanding of the leak, attacks that take full advantage of the novel vector, and security evaluations in more practical scenarios are still missing. In this paper, we conduct a thorough experimental analysis of the peculiar properties of Screaming Channels. For example, we learn about the coexistence of intended and unintended data, the role of distance and other parameters on the strength of the leak, the distortion of the leakmodel, and the portability of the profiles. With such insights, we build better attacks. We profile a device connected via cable with 10000·500 traces. Then, 5 months later, we attack a different instance at 15m in an office environment. We recover the AES-128 key with 5000·1000 traces and key enumeration up to 223. Leveraging spatial diversity, we mount some attacks in the presence of obstacles. As a first example of application to a real system, we show a proof-of-concept attack against the authentication method of Google Eddystone beacons. On the one side, this work lowers the bar for more realistic attacks, highlighting the importance of the novel attack vector. On the other side, it provides a broader security evaluation of the leaks, helping the defender and radio designers to evaluate risk, and the need of countermeasures.
2020
TCHES
Unrolled Cryptography on Silicon: A Physical Security Analysis 📺
Cryptographic primitives with low-latency performance have gained momentum lately due to an increased demand for real-time applications. Block ciphers such as PRINCE enable data encryption (resp. decryption) within a single clock cycle at a moderately high operating frequency when implemented in a fully-unrolled fashion. Unsurprisingly, many typical environments for unrolled ciphers require protection against physical adversaries as well. Yet, recent works suggest that most common SCA countermeasures are hard to apply to low-latency circuits. Hardware masking, for example, requires register stages to offer resistance, thus adding delay and defeating the purpose of unrolling. On another note, it has been indicated that unrolled primitives without any additional means of protection offer an intrinsic resistance to SCA attacks due to their parallelism, asynchronicity and speed of execution. In this work, we take a closer look at the physical security properties provided by unrolled cryptographic IC implementations. We are able to confirm that the nature of unrolling indeed bears the potential to decrease the susceptibility of cipher implementations significantly when reset methods are applied. With respect to certain adversarial models, e.g., ciphertext-only access, an amazingly high level of protection can be achieved. While this seems to be a great result for cryptographic hardware engineers, there is an attack vector hidden in plain sight which still threatens the security of unrolled implementations remarkably – namely the static power consumption of CMOS-based circuits. We point out that essentially all reasons which make it hard to extract meaningful information from the dynamic behavior of unrolled primitives are not an issue when exploiting the static currents for key recovery. Our evaluation is based on real-silicon measurements of an unrolled PRINCE core in a custom 40nm ASIC. The presented results serve as a neat educational case study to demonstrate the broad differences between dynamic and static power information leakage in the light of technological advancement.
2020
TCHES
When one vulnerable primitive turns viral: Novel single-trace attacks on ECDSA and RSA 📺
Microarchitecture based side-channel attacks are common threats nowadays. Intel SGX technology provides a strong isolation from an adversarial OS, however, does not guarantee protection against side-channel attacks. In this paper, we analyze the security of the mbedTLS binary GCD algorithm, an implementation that offers interesting challenges when compared for example with OpenSSL, due to the usage of very tight loops in the former. Using practical experiments we demonstrate the mbedTLS binary GCD implementation is vulnerable to side-channel analysis using the SGX-Step framework against mbedTLS based SGX enclaves.We analyze the security of some use cases of this algorithm in this library, resulting in the discovery of a new vulnerability in the ECDSA code path that allows a single-trace attack against this implementation. This vulnerability is three-fold interesting: It resides in the implementation of a countermeasure which makes it more dangerous due to the false state of security the countermeasure currently offers. It reduces mbedTLS ECDSA security to an integer factorization problem. An unexpected GCD call inside the ECDSA code path compromises the countermeasure. We also cover an orthogonal use case, this time inside the mbedTLS RSA code path during the computation of a CRT parameter when loading a private key. The attack also exploits the binary GCD implementation threat, showing how a single vulnerable primitive leads to multiple vulnerabilities. We demonstrate both security threats with end-to-end attacks using 1000 trials each, showing in both cases single-trace attacks can be achieved with success rates very close to 100%.