International Association for Cryptologic Research

International Association
for Cryptologic Research

CryptoDB

Papers from Transactions on Cryptographic Hardware and Embedded Systems 2018

Year
Venue
Title
2018
TCHES
A Cautionary Note When Looking for a Truly Reconfigurable Resistive RAM PUF 📺
The reconfigurable physically unclonable function (PUF) is an advanced security hardware primitive, suitable for applications requiring key renewal or similar refresh functions. The Oxygen vacancies-based resistive RAM (RRAM), has been claimed to be a physically reconfigurable PUF due to its intrinsic switching variability. This paper first analyzes and compares various previously published RRAM-based PUFs with a physics-based RRAM model. We next discuss their possible reconfigurability assuming an ideal configuration-to-configuration behavior. The RRAM-to-RRAM variability, which mainly originates from a variable number of unremovable vacancies inside the RRAM filament, however, has been observed to have significant impact on the reconfigurability. We show by quantitative analysis on the clear uniqueness degradation from the ideal situation in all the discussed implementations. Thus we conclude that true reconfigurability with RRAM PUFs might be unachievable due to this physical phenomena.
2018
TCHES
Attacking GlobalPlatform SCP02-compliant Smart Cards Using a Padding Oracle Attack 📺
We describe in this paper how to perform a padding oracle attack against the GlobalPlatform SCP02 protocol. SCP02 is implemented in smart cards and used by transport companies, in the banking world and by mobile network operators (UICC/SIM cards). The attack allows an adversary to efficiently retrieve plaintext bytes from an encrypted data field. We provide results of our experiments done with 10 smart cards from six different card manufacturers, and show that, in our experimental setting, the attack is fully practical. Given that billions SIM cards are produced every year, the number of affected cards, although difficult to estimate, is potentially high. To the best of our knowledge, this is the first successful attack against SCP02.
2018
TCHES
Beetle Family of Lightweight and Secure Authenticated Encryption Ciphers 📺
This paper presents a lightweight, sponge-based authenticated encryption (AE) family called Beetle. When instantiated with the PHOTON permutation from CRYPTO 2011, Beetle achieves the smallest footprint—consuming only a few more than 600 LUTs on FPGA while maintaining 64-bit security. This figure is significantly smaller than all known lightweight AE candidates which consume more than 1,000 LUTs, including the latest COFB-AES from CHES 2017. In order to realize such small hardware implementation, we equip Beetle with an “extremely tight” bound of security. The trick is to use combined feedback to create a difference between the cipher text block and the rate part of the next feedback (in traditional sponge these two values are the same). Then we are able to show that Beetle is provably secure up to min{c − log r, b/2, r} bits, where b is the permutation size and r and c are parameters called rate and capacity, respectively. The tight security bound allows us to select the smallest security parameters, which in turn result in the smallest footprint.
2018
TCHES
CacheQuote: Efficiently Recovering Long-term Secrets of SGX EPID via Cache Attacks 📺
Intel Software Guard Extensions (SGX) allows users to perform secure computation on platforms that run untrusted software. To validate that the computation is correctly initialized and that it executes on trusted hardware, SGX supports attestation providers that can vouch for the user’s computation. Communication with these attestation providers is based on the Extended Privacy ID (EPID) protocol, which not only validates the computation but is also designed to maintain the user’s privacy. In particular, EPID is designed to ensure that the attestation provider is unable to identify the host on which the computation executes. In this work we investigate the security of the Intel implementation of the EPID protocol. We identify an implementation weakness that leaks information via a cache side channel. We show that a malicious attestation provider can use the leaked information to break the unlinkability guarantees of EPID. We analyze the leaked information using a lattice-based approach for solving the hidden number problem, which we adapt to the zero-knowledge proof in the EPID scheme, extending prior attacks on signature schemes.
2018
TCHES
Cold Boot Attacks on Ring and Module LWE Keys Under the NTT
Best Paper at CHES 2019
In this work, we consider the ring- and module- variants of the LWE problem and investigate cold boot attacks on cryptographic schemes based on these problems, wherein an attacker is faced with the problem of recovering a scheme’s secret key from a noisy version of that key. The leakage resilience of cryptography based on the learning with errors (LWE) problem has been studied before, but there are only limited results considering the parameters observed in cold boot attack scenarios. There are two main encodings for storing ring- and module-LWE keys, and, as we show, the performance of cold boot attacks can be highly sensitive to the exact encoding used. The first encoding stores polynomial coefficients directly in memory. The second encoding performs a number theoretic transform (NTT) before storing the key, a commonly used method leading to more efficient implementations. We first give estimates for a cold boot attack complexity on the first encoding method based on standard algorithms; this analysis confirms that this encoding method is vulnerable to cold boot attacks only at very low bit-flip rates. We then show that, for the second encoding method, the structure introduced by using an NTT is exploitable in the cold boot setting: we develop a bespoke attack strategy that is much cheaper than our estimates for the first encoding when considering module-LWE keys. For example, at a 1% bit-flip rate (which corresponds roughly to what can be achieved in practice for cold boot attacks when applying cooling), a cold boot attack on Kyber KEM parameters has a cost of 243 operations when the second, NTT-based encoding is used for key storage, compared to 270 operations with the first encoding. On the other hand, in the case of the ring-LWE-based KEM, New Hope, the cold boot attack complexities are similar for both encoding methods.
2018
TCHES
Composable Masking Schemes in the Presence of Physical Defaults & the Robust Probing Model
Composability and robustness against physical defaults (e.g., glitches) are two highly desirable properties for secure implementations of masking schemes. While tools exist to guarantee them separately, no current formalism enables their joint investigation. In this paper, we solve this issue by introducing a new model, the robust probing model, that is naturally suited to capture the combination of these properties. We first motivate this formalism by analyzing the excellent robustness and low randomness requirements of first-order threshold implementations, and highlighting the difficulty to extend them to higher orders. Next, and most importantly, we use our theory to design and prove the first higher-order secure, robust and composable multiplication gadgets. While admittedly inspired by existing approaches to masking (e.g., Ishai-Sahai-Wagner-like, threshold, domain-oriented), these gadgets exhibit subtle implementation differences with these state-of-the-art solutions (none of which being provably composable and robust). Hence, our results illustrate how sound theoretical models can guide practically-relevant implementations.
2018
TCHES
CRYSTALS-Dilithium: A Lattice-Based Digital Signature Scheme
In this paper, we present the lattice-based signature scheme Dilithium, which is a component of the CRYSTALS (Cryptographic Suite for Algebraic Lattices) suite that was submitted to NIST’s call for post-quantum cryptographic standards. The design of the scheme avoids all uses of discrete Gaussian sampling and is easily implementable in constant-time. For the same security levels, our scheme has a public key that is 2.5X smaller than the previously most efficient lattice-based schemes that did not use Gaussians, while having essentially the same signature size. In addition to the new design, we significantly improve the running time of the main component of many lattice-based constructions – the number theoretic transform. Our AVX2-based implementation results in a speed-up of roughly a factor of 2 over the previously best algorithms that appear in the literature. The techniques for obtaining this speed-up also have applications to other lattice-based schemes.
2018
TCHES
Data Flow Oriented Hardware Design of RNS-based Polynomial Multiplication for SHE Acceleration
This paper presents a hardware implementation of a Residue Polynomial Multiplier (RPM), designed to accelerate the full Residue Number System (RNS) variant of the Fan-Vercauteren scheme proposed by Bajard et al. [BEHZ16]. Our design speeds up polynomial multiplication via a Negative Wrapped Convolution (NWC) which locally computes the required RNS channel dependent twiddle factors. Compared to related works, this design is more versatile regarding the addressable parameter sets for the BFV scheme. This is mainly brought by our proposed twiddle factor generator that makes the design BRAM utilization independent of the RNS basis size, with a negligible communication bandwidth usage for non-payload data. Furthermore, the generalization of a DFT hardware generator is explored in order to generate RNS friendly NTT architectures. This approach helps us to validate our RPM design over parameter sets from the work of Halevi et al. [HPS18]. For the depth-20 setting, we achieve an estimated speed up for the residue polynomial multiplications greater than 76 during ciphertexts multiplication, and greater than 16 during relinearization. It thus results in a single-threaded Mult&Relin ciphertext operation in 109.4 ms (×3.19 faster than [HPS18]) with RPM counting for less than 15% of the new computation time. Our RPM design scales up with reasonable use of hardware resources and realistic bandwidth requirements. It can also be exploited for other RNS based implementations of RLWE cryptosystems.
2018
TCHES
Differential Fault Attacks on Deterministic Lattice Signatures
In this paper, we extend the applicability of differential fault attacks to lattice-based cryptography. We show how two deterministic lattice-based signature schemes, Dilithium and qTESLA, are vulnerable to such attacks. In particular, we demonstrate that single random faults can result in a nonce-reuse scenario which allows key recovery. We also expand this to fault-induced partial nonce-reuse attacks, which do not corrupt the validity of the computed signatures and thus are harder to detect.Using linear algebra and lattice-basis reduction techniques, an attacker can extract one of the secret key elements after a successful fault injection. Some other parts of the key cannot be recovered, but we show that a tweaked signature algorithm can still successfully sign any message. We provide experimental verification of our attacks by performing clock glitching on an ARM Cortex-M4 microcontroller. In particular, we show that up to 65.2% of the execution time of Dilithium is vulnerable to an unprofiled attack, where a random fault is injected anywhere during the signing procedure and still leads to a successful key-recovery.
2018
TCHES
Dismantling the AUT64 Automotive Cipher 📺
AUT64 is a 64-bit automotive block cipher with a 120-bit secret key used in a number of security sensitive applications such as vehicle immobilization and remote keyless entry systems. In this paper, we present for the first time full details of AUT64 including a complete specification and analysis of the block cipher, the associated authentication protocol, and its implementation in a widely-used vehicle immobiliser system that we have reverse engineered. Secondly, we reveal a number of cryptographic weaknesses in the block cipher design. Finally, we study the concrete use of AUT64 in a real immobiliser system, and pinpoint severe weaknesses in the key diversification scheme employed by the vehicle manufacturer. We present two key-recovery attacks based on the cryptographic weaknesses that, combined with the implementation flaws, break both the 8 and 24 round configurations of AUT64. Our attack on eight rounds requires only 512 plaintext-ciphertext pairs and, in the worst case, just 237.3 offline encryptions. In most cases, the attack can be executed within milliseconds on a standard laptop. Our attack on 24 rounds requires 2 plaintext-ciphertext pairs and 248.3 encryptions to recover the 120-bit secret key in the worst case. We have strong indications that a large part of the key is kept constant across vehicles, which would enable an attack using a single communication with the transponder and negligible offline computation.
2018
TCHES
Efficient Side-Channel Protections of ARX Ciphers
The current state of the art of Boolean masking for the modular addition operation in software has a very high performance overhead. Firstly, the instruction count is very high compared to a normal addition operation. Secondly, until recently, the entropy consumed by such protections was also quite high. Our paper significantly improves both aspects, by applying the Threshold Implementation (TI) methodology with two shares and by reusing internal values as randomness source in such a way that the uniformity is always preserved. Our approach performs considerably faster compared to the previously known masked addition and subtraction algorithms by Coron et al. and Biryukov et al. improving the state of the art by 36%, if we only consider the number of ARM assembly instructions. Furthermore, similar to the masked adder from Biryukov et al. we reduce the amount of randomness and only require one bit additional entroy per addition, which is a good trade-off for the improved performance. We applied our improved masked adder to ChaCha20, for which we provide two new first-order protected implementations and achieve a 36% improvement over the best published result for ChaCha20 using an ARM Cortex-M4 microprocessor.
2018
TCHES
EM Analysis in the IoT Context: Lessons Learned from an Attack on Thread 📺
The distinguishing feature of the Internet of Things is that many devices get interconnected. The threat of side-channel attacks in this setting is less understood than the threat of traditional network and software exploitation attacks that are perceived to be more powerful.This work is a case study of Thread, an emerging network and transport level stack designed to facilitate secure communication between heterogeneous IoT devices. We perform the first side-channel vulnerability analysis of the Thread networking stack. We leverage various network mechanisms to trigger manipulations of the security material or to get access to the network credentials. We choose the most feasible attack vector to build a complete attack that combines network specific mechanisms and Differential Electromagnetic Analysis. When successfully applied on a Thread network, the attack gives full network access to the adversary. We evaluate the feasibility of our attack in a TI CC2538 setup running OpenThread, a certified open-source implementation of the stack.The full attack does not succeed in our setting. The root cause for this failure is not any particular security feature of the protocol or the implementation, but a side-effect of a feature not related to security. We summarize the problems that we find in the protocol with respect to side-channel analysis, and suggest a range of countermeasures to prevent our attack and the other attack vectors we identified during the vulnerability analysis.In general, we demonstrate that elaborate security mechanisms of Thread make a side-channel attack not trivial to mount. Similar to a modern software exploit, it requires chaining multiple vulnerabilities. Nevertheless, such attacks are feasible. Being perhaps too expensive for settings like smart homes, they pose a relatively higher threat to the commercial setting. We believe our experience provides a useful lesson to designers of IoT protocols and devices.
2018
TCHES
ES-TRNG: A High-throughput, Low-area True Random Number Generator based on Edge Sampling
In this paper we present a novel true random number generator based on high-precision edge sampling. We use two novel techniques to increase the throughput and reduce the area of the proposed randomness source: variable-precision phase encoding and repetitive sampling. The first technique consists of encoding the oscillator phase with high precision in the regions around the signal edges and with low precision everywhere else. This technique results in a compact implementation at the expense of reduced entropy in some samples. The second technique consists of repeating the sampling at high frequency until the phase region encoded with high precision is captured. This technique ensures that only the high-entropy bits are sent to the output. The combination of the two proposed techniques results in a secure TRNG, which suits both ASIC and FPGA implementations. The core part of the proposed generator is implemented with 10 look-up tables (LUTs) and 5 flip-flops (FFs) of a Xilinx Spartan-6 FPGA, and achieves a throughput of 1.15 Mbps with 0.997 bits of Shannon entropy. On Intel Cyclone V FPGAs, this implementation uses 10 LUTs and 6 FFs, and achieves a throughput of 1.07 Mbps. This TRNG design is supported by a stochastic model and a formal security evaluation.
2018
TCHES
Evaluation and Monitoring of Free Running Oscillators Serving as Source of Randomness
In this paper, we evaluate clock signals generated in ring oscillators and self-timed rings and the way their jitter can be transformed into random numbers. We show that counting the periods of the jittery clock signal produces random numbers of significantly better quality than the methods in which the jittery signal is simply sampled (the case in almost all current methods). Moreover, we use the counter values to characterize and continuously monitor the source of randomness. However, instead of using the widely used statistical variance, we propose to use Allan variance to do so. There are two main advantages: Allan variance is insensitive to low frequency noises such as flicker noise that are known to be autocorrelated and significantly less circuitry is required for its computation than that used to compute commonly used variance. We also show that it is essential to use a differential principle of randomness extraction from the jitter based on the use of two identical oscillators to avoid autocorrelations originating from external and internal global jitter sources and that this fact is valid for both kinds of rings. Last but not least, we propose a method of statistical testing based on high order Markov model to show the reduced dependencies when the proposed randomness extraction is applied.
2018
TCHES
ExpFault: An Automated Framework for Exploitable Fault Characterization in Block Ciphers 📺
Malicious exploitation of faults for extracting secrets is one of the most practical and potent threats to modern cryptographic primitives. Interestingly, not every possible fault for a cryptosystem is maliciously exploitable, and evaluation of the exploitability of a fault is nontrivial. In order to devise precise defense mechanisms against such rogue faults, a comprehensive knowledge is required about the exploitable part of the fault space of a cryptosystem. Unfortunately, the fault space is diversified and of formidable size even while a single cryptoprimitive is considered and traditional manual fault analysis techniques may often fall short to practically cover such a fault space within reasonable time. An automation for analyzing individual fault instances for their exploitability is thus inevitable. Such an automation is supposed to work as the core engine for analyzing the fault spaces of cryptographic primitives. In this paper, we propose an automation for evaluating the exploitability status of fault instances from block ciphers, mainly in the context of Differential Fault Analysis (DFA) attacks. The proposed framework is generic and scalable, which are perhaps the two most important features for covering diversified fault spaces of formidable size originating from different ciphers. As a proof-of-concept, we reconstruct some known attack examples on AES and PRESENT using the framework and finally analyze a recently proposed cipher GIFT [BPP+17] for the first time. It is found that the secret key of GIFT can be uniquely determined with 1 nibble fault instance injected at the beginning of the 25th round with a reasonable computational complexity of 214.
2018
TCHES
Extending Glitch-Free Multiparty Protocols to Resist Fault Injection Attacks
Side channel analysis and fault attacks are two powerful methods to analyze and break cryptographic implementations. At CHES 2011, Roche and Prouff applied secure multiparty computation to prevent side-channel attacks. While multiparty computation is known to be fault-resistant as well, the particular scheme used for side-channel protection does not currently offer this feature. This work introduces a new secure multiparty circuit to prevent both fault injection attacks and sidechannel analysis. The new scheme extends the Roche and Prouff scheme to make faults detectable. Arithmetic operations have been redesigned to propagate fault information until a new secrecy-preserving fault detection can be performed. A new recombination operation ensures randomization of the output in the case of a fault, ensuring that nothing can be learned from the faulty output. The security of the new scheme is proved in the ISW probing model, using the reformulated t-SNI security notion. Besides the new scheme and its security proof, we also present an extensive performance analysis, including a proof-of-concept, software-based AES implementation featuring the masking technique to resist both fault and side-channel attacks at the same time. The performance analysis for different security levels are given for the ARM-M0+ MCU with its memory requirements. A comprehensive leakage analysis shows that a careful implementation of the scheme achieves the expected security level.
2018
TCHES
FACE: Fast AES CTR mode Encryption Techniques based on the Reuse of Repetitive Data
The Advanced Encryption Standard (AES) algorithm and Counter (CTR) mode are used for numerous services as an encryption technique that provides confidentiality. Even though the AES with counter (AES CTR) mode has an advantage in that it can process multiple data blocks in parallel, its implementation should also be observed to reduce the computational burden of current services.In this paper, we propose an implementation method called FACE that can improve the performance of the AES CTR mode. The proposed method is based on five caches of frequently occurring intermediate values, so that it reduces the number of unnecessary computations. Our method can be employed in any AES CTR implementation, regardless of the platform, environment, or implementation method. There are two known AES implementation techniques, namely, counter-mode caching and bitslicing. FACE extends counter-mode caching in order to optimize the previous result and to maximize the scope of caching. We show that FACE can be applied efficiently to various implementations (table-based, bitsliced, and AES-NI-based). In particular, this is the first attempt to combine our extended counter-mode caching with bitsliced implementations of AES, and is also the first to apply counter-mode caching up to the round transformations of AES-NI implementation. To prove the efficiency of our proposed method, we conduct a performance evaluation in various environments, which we then compare with the previous fastest results. Our bitsliced FACE needs 6.41 cycles/byte on an Intel Core 2, and AES-NI-based FACE records 0.44 cycles/byte on an Intel Core i7.
2018
TCHES
Fast FPGA Implementations of Diffie-Hellman on the Kummer Surface of a Genus-2 Curve 📺
We present the first hardware implementations of Diffie-Hellman key exchange based on the Kummer surface of Gaudry and Schost’s genus-2 curve targeting a 128-bit security level. We describe a single-core architecture for lowlatency applications and a multi-core architecture for high-throughput applications. Synthesized on a Xilinx Zynq-7020 FPGA, our architectures perform a key exchange with lower latency and higher throughput than any other reported implementation using prime-field elliptic curves at the same security level. Our single-core architecture performs a scalar multiplication with a latency of 82 microseconds while our multicore architecture achieves a throughput of 91,226 scalar multiplications per second. When compared to similar implementations of Microsoft’s Fourℚ on the same FPGA, this translates to an improvement of 48% in latency and 40% in throughput for the single-core and multi-core architecture, respectively. Both our designs exhibit constant-time execution to thwart timing attacks, use the Montgomery ladder for improved resistance against SPA, and support a countermeasure against fault attacks.
2018
TCHES
Fault Attacks Made Easy: Differential Fault Analysis Automation on Assembly Code 📺
Over the past decades, fault injection attacks have been extensively studied due to their capability to efficiently break cryptographic implementations. Fault injection attack models are normally determined by analyzing the cipher structure and finding exploitable spots in non-linear and permutation layers. However, this level of abstraction is often too high to distinguish vulnerable parts of software implementations, due to specific operations and optimizations. On the other hand, manually analyzing the assembly code requires non-negligible amount of time and expertise. In this paper, we propose an automated approach for analyzing cipher implementations in assembly. We represent the whole assembly program as a data flow graph so that the vulnerable spots can be found efficiently. Fault propagation is analyzed in a subgraph constructed from each vulnerable spot, allowing equations for Differential Fault Analysis (DFA) to be automatically generated. We have created a tool that implements our approach: DATAC – DFA Automation Tool for Assembly Code. We have successfully used this tool for attacking PRESENT- 80, being able to find implementation-specific vulnerabilities that can be exploited in order to recover the last round key with 16 faults. Our results show that DATAC is useful in finding attack spots that are not visible from the cipher structure, but can be easily exploited when dealing with real-world implementations.
2018
TCHES
FPGA-based Accelerator for Post-Quantum Signature Scheme SPHINCS-256 📺
In recent years, a substantial amount of research has been conducted and progress made in the area of quantum computers. Small functional prototypes have already been reported. If they scale as expected, they will eventually be able to break current public-key cryptosystems. The goal of post-quantum cryptography is to develop cryptographic systems that are secure against attacks originating from both quantum and classical computers. Frequently referred post-quantum signature schemes are based on the security of hash functions. A promising candidate in this group is SPHINCS-256. This paper presents the first FPGA-based hardware accelerator for SPHINCS-256. It can be implemented on an entry-level FPGA, occupying roughly 19,000 LUTs, 38,000 FFs and 36 BRAMs. On a Kintex-7 Xilinx FPGA, signing takes 1.53 milliseconds, and verification needs only 65 microseconds. Area and throughput of the accelerator are in a range that outperform today’s widely used RSA signature scheme. The performance can even keep up with ECDSA accelerators. Hence, SPHINCS-256 is a hot candidate to replace RSA and ECDSA in a post-quantum world.
2018
TCHES
FPGAhammer: Remote Voltage Fault Attacks on Shared FPGAs, suitable for DFA on AES
With each new technology generation, the available resources on Field Programmable Gate Arrays increase, making them more attractive for partial access from multiple users. They get increasingly adopted as accelerators in various application domains, embedded in shared Systems on Chip or remote cloud services. Thus, some recent works have already explored Denial-of-Service and side-channel attacks, where an FPGA fabric is shared among multiple users. In this work, we show how fault attacks can be launched within an FPGA, through software-provided bitstreams alone. Excessive voltage drops can be generated from legitimate logic mapped into the FPGA to cause timing faults, reaching from spatially and logically isolated partitions of one to another user of the FPGA fabric. To cause this voltage drop, we first show how specific patterns to activate Ring Oscillators can cause timing failures in simple test designs on various FPGA boards. Subsequently, we analyze and adapt an existing fault model for the Advanced Encryption Standard to match the accuracy of our fault attack. In the same multi-user scenario, we show as a proof-of-concept how a successful Differential Fault Analysis attack on an AES module can be launched. We perform experiments on three FPGA boards of the same model and confirm that the attack adapts to all systems and is successful under process variation, but with different susceptibility to faults. The paper is concluded by validating the attack on another platform, and analyzing the vulnerability based on a timing analysis, proving the applicability to different devices.
2018
TCHES
Generic Low-Latency Masking in Hardware 📺
In this work, we introduce a generalized concept for low-latency masking that is applicable to any implementation and protection order, and (in its most extreme form) does not require on-the-fly randomness. The main idea of our approach is to avoid collisions of shared variables in nonlinear circuit parts and to skip the share compression. We show the feasibility of our approach on a full implementation of a one-round unrolled Ascon variant and on an AES S-box case study. Additionally, we discuss possible trade-offs to make our approach interesting for practical implementations. As a result, we obtain a first-order masked AES S-box that is calculated in a single clock cycle with rather high implementation costs (60.7 kGE), and a two-cycle variant with much less implementation costs (6.7 kGE). The side-channel resistance of our Ascon S-box designs up to order three are then verified using the formal analysis tool of [BGI+18]. Furthermore, we introduce a taint checking based verification approach that works specifically for our low-latency approach and allows us to verify large circuits like our low-latency AES S-box design in reasonable time.
2018
TCHES
Hardware Masking, Revisited 📺
MaskingHardware masking schemes have shown many advances in the past few years. Through a series of publications their implementation cost has dropped significantly and flaws have been fixed where present. Despite these advancements it seems that a limit has been reached when implementing masking schemes on FPGA platforms. Indeed, even with a correct transition from the masking scheme to the masking realization (i.e., when the implementation is not buggy) it has been shown that the implementation can still exhibit unexpected leakage, e.g., through variations in placement and routing. In this work, we show that the reason for such unexpected leakages is the violation of an underlying assumption made by all masking schemes, i.e., that the leakage of the circuit is a linear sum of leakages associated to each share. In addition to the theory of VLSI which supports our claim, we perform a wide range of experiments based on an FPGA) to find out under what circumstances this causes a masked hardware implementation to show undesirable leakage. We further illustrate case studies, where publicly-known secure designs exhibit first-order leakage when being operated at certain conditions.
2018
TCHES
High Order Masking of Look-up Tables with Common Shares 📺
Masking is an effective countermeasure against side-channel attacks. In this paper, we improve the efficiency of the high-order masking of look-up tables countermeasure introduced at Eurocrypt 2014, based on a combination of three techniques, and still with a proof of security in the Ishai-Sahai-Wagner (ISW) probing model. The first technique consists in proving security under the stronger t-SNI definition, which enables to use n = t+1 shares instead of n = 2t+1 against t-th order attacks. The second technique consists in progressively incrementing the number of shares within the countermeasure, from a single share to n, thereby reducing the complexity of the countermeasure. The third technique consists in adapting the common shares approach introduced by Coron et al. at CHES 2016, so that half of a randomized look-up table can be pre-computed for multiple SBoxes. We show that our techniques perform well in practice. In theory, the combination of the three techniques should lead to a factor 10.7 improvement in efficiency, for a large number of shares. For a practical implementation with a reasonable number of shares, we get a 4.8 speed-up factor for AES.
2018
TCHES
High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA 📺
Homomorphic encryption (HE) offers great capabilities that can solve a wide range of privacy-preserving computing problems. This tool allows anyone to process encrypted data producing encrypted results that only the decryption key’s owner can decrypt. Although HE has been realized in several public implementations, its performance is quite demanding. The reason for this is attributed to the huge amount of computation required by secure HE schemes. In this work, we present a CUDAbased implementation of the Fan and Vercauteren (FV) Somewhat HomomorphicEncryption (SHE) scheme. We demonstrate several algebraic tools such as the Chinese Remainder Theorem (CRT), Residual Number System (RNS) and Discrete Galois Transform (DGT) to accelerate and facilitate FV computation on GPUs. We also show how the entire FV computation can be done on GPU without multi-precision arithmetic. We compare our GPU implementation with two mature state-of-the-art implementations: 1) Microsoft SEAL v2.3.0-4 and 2) NFLlib-FV. Our implementation outperforms them and achieves on average 5.37x, 7.37x, 22.22x, 5.11x and 13.18x (resp. 2.03x, 2.94x, 27.86x, 8.53x and 18.69x) for key generation, encryption, decryption, homomorphic addition and homomorphic multiplication against SEAL-FVRNS (resp. NFLlib-FV).
2018
TCHES
Improved High-Order Conversion From Boolean to Arithmetic Masking 📺
Masking is a very common countermeasure against side channel attacks. When combining Boolean and arithmetic masking, one must be able to convert between the two types of masking, and the conversion algorithm itself must be secure against side-channel attacks. An efficient high-order Boolean to arithmetic conversion scheme was recently described at CHES 2017, with complexity independent of the register size. In this paper we describe a simplified variant with fewer mask refreshing, and still with a proof of security in the ISW probing model. In practical implementations, our variant is roughly 25% faster.
2018
TCHES
Key Extraction Using Thermal Laser Stimulation A Case Study on Xilinx Ultrascale FPGAs
Thermal laser stimulation (TLS) is a failure analysis technique, which can be deployed by an adversary to localize and read out stored secrets in the SRAM of a chip. To this date, a few proof-of-concept experiments based on TLS or similar approaches have been reported in the literature, which do not reflect a real attack scenario. Therefore, it is still questionable whether this attack technique is applicable to modern ICs equipped with side-channel countermeasures. The primary aim of this work is to assess the feasibility of launching a TLS attack against a device with robust security features. To this end, we select a modern FPGA, and more specifically, its key memory, the so-called battery-backed SRAM (BBRAM), as a target. We demonstrate that an attacker is able to extract the stored 256-bit AES key used for the decryption of the FPGA’s bitstream, by conducting just a single non-invasive measurement. Moreover, it becomes evident that conventional countermeasures are incapable of preventing our attack since the FPGA is turned off during key recovery. Based on our time measurements, the required effort to develop the attack is shown to be less than 7 hours. To avert this powerful attack, we propose a low-cost and CMOS compatible countermeasure circuit, which is capable of protecting the BBRAM from TLS attempts even when the FPGA is powered off. Using a proof-of-concept prototype of our countermeasure, we demonstrate its effectiveness against TLS key extraction attempts.
2018
TCHES
Leakage Detection with the x2-Test 📺
We describe how Pearson’s χ2-test can be used as a natural complement to Welch’s t-test for black box leakage detection. In particular, we show that by using these two tests in combination, we can mitigate some of the limitations due to the moment-based nature of existing detection techniques based on Welch’s t-test (e.g., for the evaluation of higher-order masked implementations with insufficient noise). We also show that Pearson’s χ2-test is naturally suited to analyze threshold implementations with information lying in multiple statistical moments, and can be easily extended to a distinguisher for key recovery attacks. As a result, we believe the proposed test and methodology are interesting complementary ingredients of the side-channel evaluation toolbox, for black box leakage detection and non-profiled attacks, and as a preliminary before more demanding advanced analyses.
2018
TCHES
Linear Repairing Codes and Side-Channel Attacks 📺
To strengthen the resistance of countermeasures based on secret sharing,several works have suggested to use the scheme introduced by Shamir in 1978, which proposes to use the evaluation of a random d-degree polynomial into n ≥ d + 1 public points to share the sensitive data. Applying the same principles used against the classical Boolean sharing, all these works have assumed that the most efficient attack strategy was to exploit the minimum number of shares required to rebuild the sensitive value; which is d + 1 if the reconstruction is made with Lagrange’s interpolation. In this paper, we highlight first an important difference between Boolean and Shamir’s sharings which implies that, for some signal-to-noise ratio, it is more advantageous for the adversary to observe strictly more than d + 1 shares. We argue that this difference is related to the existence of so-called linear exact repairing codes, which themselves come with reconstruction formulae that need (much) less information (counted in bits) than Lagrange’s interpolation. In particular, this result implies that the choice of the public points in Shamir’s sharing has an impact on the countermeasure strength, which confirms previous observations made by Wang et al. at CARDIS 2016 for the so-called inner product sharing which is a generalization of Shamir’s scheme. As another contribution, we exhibit a positive impact of the existence of linear exact repairing schemes; we indeed propose to use them to improve the state-of-the-art multiplication algorithms dedicated to Shamir’s sharing. We argue that the improvement can be effective when the multiplication operation in the sub-fields is at least two times smaller than that of the base field.
2018
TCHES
Low Randomness Masking and Shuffling: An Evaluation Using Mutual Information
Side-channel countermeasure designers often face severe performance overheads when trying to protect a device. Widely applied countermeasures such as masking and shuffling entail generating a large amount of random numbers, which can result in a computational bottleneck. To mitigate the randomness cost, this work evaluates low-randomness versions of both masking and shuffling, namely Recycled Randomness Masking (RRM) and Reduced Randomness Shuffling (RRS). These countermeasures employ memory units to store generated random numbers and reuse them in subsequent computations,making them primarily suitable for implementation on devices with sufficient memory. Both RRM and RRS are evaluated using the MI-based framework in the context of horizontal attacks. The evaluation exhibits the tradeoff between the randomness cost and the noisy leakage security level offered by the countermeasures, enabling the designer to fine-tune a masking or shuffling scheme and maximize the security level achieved for a certain cost.
2018
TCHES
Mixing Additive and Multiplicative Masking for Probing Secure Polynomial Evaluation Methods 📺
Masking is a sound countermeasure to protect implementations of block- cipher algorithms against Side Channel Analysis (SCA). Currently, the most efficient masking schemes use Lagrange’s Interpolation Theorem in order to represent any S- box by a polynomial function over a binary finite field. Masking the processing of an S-box is then achieved by masking every operation involved in the evaluation of its polynomial representation. While the common approach requires to use the well- known Ishai-Sahai-Wagner (ISW) scheme in order to secure this processing, there exist alternatives. In the particular case of power functions, Genelle, Prouff and Quisquater proposed an efficient masking scheme (GPQ). However, no generalization has been suggested for polynomial functions so far. In this paper, we solve the open problem of extending GPQ for polynomials, and we also solve the open problem of proving that both the original scheme and its variants for polynomials satisfy the t-SNI security definition. Our approach to extend GPQ is based on the cyclotomic method and results in an alternate cyclotomic method which is three times faster in practice than the original proposal in almost all scenarios we address. The best- known method for polynomial evaluation is currently CRV which requires to use the cyclotomic method for one of its step. We also show how to plug our alternate cyclo- tomic approach into CRV and again provide an alternate approach that outperforms the original in almost all scenarios. We consider the masking of n-bit S-boxes for n ∈ [4;8] and we get in practice 35% improvement of efficiency for S-boxes with dimension n ∈ {5,7,8} and 25% for 6-bit S-boxes.
2018
TCHES
Multiplicative Masking for AES in Hardware
Hardware masked AES designs usually rely on Boolean masking and perform the computation of the S-box using the tower-field decomposition. On the other hand, splitting sensitive variables in a multiplicative way is more amenable for the computation of the AES S-box, as noted by Akkar and Giraud. However, multiplicative masking needs to be implemented carefully not to be vulnerable to first-order DPA with a zero-value power model. Up to now, sound higher-order multiplicative masking schemes have been implemented only in software. In this work, we demonstrate the first hardware implementation of AES using multiplicative masks. The method is tailored to be secure even if the underlying gates are not ideal and glitches occur in the circuit. We detail the design process of first- and second-order secure AES-128 cores, which result in the smallest die area to date among previous state-of-the-art masked AES implementations with comparable randomness cost and latency. The first- and second-order masked implementations improve resp. 29% and 18% over these designs. We deploy our construction on a Spartan-6 FPGA and perform a side-channel evaluation. No leakage is detected with up to 50 million traces for both our first- and second-order implementation. For the latter, this holds both for univariate and bivariate analysis.
2018
TCHES
New Bleichenbacher Records: Fault Attacks on qDSA Signatures
In this paper, we optimize Bleichenbacher’s statistical attack technique against (EC)DSA and other Schnorr-like signature schemes with biased or partially exposed nonces. Previous approaches to Bleichenbacher’s attack suffered from very large memory consumption during the so-called “range reduction” phase. Using a carefully analyzed and highly parallelizable approach to this range reduction based on the Schroeppel–Shamir algorithm for knapsacks, we manage to overcome the memory barrier of previous work while maintaining a practical level of efficiency in terms of time complexity.As a separate contribution, we present new fault attacks against the qDSA signature scheme of Renes and Smith (ASIACRYPT 2017) when instantiated over the Curve25519 Montgomery curve, and we validate some of them on the AVR microcontroller implementation of qDSA using actual fault experiments on the ChipWhisperer-Lite evaluation board. These fault attacks enable an adversary to generate signatures with 2 or 3 bits of the nonces known.Combining our two contributions, we are able to achieve a full secret key recovery on qDSA by applying our version of Bleichenbacher’s attack to these faulty signatures. Using a hybrid parallelization model relying on both shared and distributed memory, we achieve a very efficient implementation of our highly scalable range reduction algorithm. This allows us to complete Bleichenbacher’s attack in the 252-bit prime order subgroup of Curve25519 within a reasonable time frame and using relatively modest computational resources both for 3-bit nonce exposure and for the much harder case of 2-bit nonce exposure. Both of these computations, and particularly the latter, set new records in the implementation of Bleichenbacher’s attack.
2018
TCHES
On Recovering Affine Encodings in White-Box Implementations
Ever since the first candidate white-box implementations by Chow et al. in 2002, producing a secure white-box implementation of AES has remained an enduring challenge. Following the footsteps of the original proposal by Chow et al., other constructions were later built around the same framework. In this framework, the round function of the cipher is “encoded” by composing it with non-linear and affine layers known as encodings. However, all such attempts were broken by a series of increasingly efficient attacks that are able to peel off these encodings, eventually uncovering the underlying round function, and with it the secret key.These attacks, however, were generally ad-hoc and did not enjoy a wide applicability. As our main contribution, we propose a generic and efficient algorithm to recover affine encodings, for any Substitution-Permutation-Network (SPN) cipher, such as AES, and any form of affine encoding. For AES parameters, namely 128-bit blocks split into 16 parallel 8-bit S-boxes, affine encodings are recovered with a time complexity estimated at 232 basic operations, independently of how the encodings are built. This algorithm is directly applicable to a large class of schemes. We illustrate this on a recent proposal due to Baek, Cheon and Hong, which was not previously analyzed. While Baek et al. evaluate the security of their scheme to 110 bits, a direct application of our generic algorithm is able to break the scheme with an estimated time complexity of only 235 basic operations.As a second contribution, we show a different approach to cryptanalyzing the Baek et al. scheme, which reduces the analysis to a standalone combinatorial problem, ultimately achieving key recovery in time complexity 231. We also provide an implementation of the attack, which is able to recover the secret key in about 12 seconds on a standard desktop computer.
2018
TCHES
On the Difficulty of FSM-based Hardware Obfuscation
In today’s Integrated Circuit (IC) production chains, a designer’s valuable Intellectual Property (IP) is transparent to diverse stakeholders and thus inevitably prone to piracy. To protect against this threat, numerous defenses based on the obfuscation of a circuit’s control path, i.e. Finite State Machine (FSM), have been proposed and are commonly believed to be secure. However, the security of these sequential obfuscation schemes is doubtful since realistic capabilities of reverse engineering and subsequent manipulation are commonly neglected in the security analysis. The contribution of our work is threefold: First, we demonstrate how high-level control path information can be automatically extracted from third-party, gate-level netlists. To this end, we extend state-of-the-art reverse engineering algorithms to deal with Field Programmable Gate Array (FPGA) gate-level netlists equipped with FSM obfuscation. Second, on the basis of realistic reverse engineering capabilities we carefully review the security of state-of-the-art FSM obfuscation schemes. We reveal several generic strategies that bypass allegedly secure FSM obfuscation schemes and we practically demonstrate our attacks for a several of hardware designs, including cryptographic IP cores. Third, we present the design and implementation of Hardware Nanomites, a novel obfuscation scheme based on partial dynamic reconfiguration that generically mitigates existing algorithmic reverse engineering.
2018
TCHES
Persistent Fault Analysis on Block Ciphers
Persistence is an intrinsic nature for many errors yet has not been caught enough attractions for years. In this paper, the feature of persistence is applied to fault attacks, and the persistent fault attack is proposed. Different from traditional fault attacks, adversaries can prepare the fault injection stage before the encryption stage, which relaxes the constraint of the tight-coupled time synchronization. The persistent fault analysis (PFA) is elaborated on different implementations of AES-128, specially fault hardened implementations based on Dual Modular Redundancy (DMR). Our experimental results show that PFA is quite simple and efficient in breaking these typical implementations. To show the feasibility and practicability of our attack, a case study is illustrated on the shared library Libgcrypt with rowhammer technique. Approximately 8200 ciphertexts are enough to extract the master key of AES-128 when PFA is applied to Libgcrypt1.6.3 with redundant encryption based DMR. This work puts forward a new direction of fault attacks and can be extended to attack other implementations under more interesting scenarios.
2018
TCHES
Practical CCA2-Secure and Masked Ring-LWE Implementation 📺
During the last years public-key encryption schemes based on the hardness of ring-LWE have gained significant popularity. For real-world security applications assuming strong adversary models, a number of practical issues still need to be addressed. In this work we thus present an instance of ring-LWE encryption that is protected against active attacks (i.e., adaptive chosen-ciphertext attacks) and equipped with countermeasures against side-channel analysis. Our solution is based on a postquantum variant of the Fujisaki-Okamoto (FO) transform combined with provably secure first-order masking. To protect the key and message during decryption, we developed a masked binomial sampler that secures the re-encryption process required by FO. Our work shows that CCA2-secured RLWE-based encryption can be achieved with reasonable performance on constrained devices but also stresses that the required transformation and handling of decryption errors implies a performance overhead that has been overlooked by the community so far. With parameters providing 233 bits of quantum security, our implementation requires 4,176,684 cycles for encryption and 25,640,380 cycles for decryption with masking and hiding countermeasures on a Cortex-M4F. The first-order security of our masked implementation is also practically verified using the non-specific t-test evaluation methodology.
2018
TCHES
Preface to TCHES 2018
DOI: 10.13154/tches.v2018.i1.I-IV
2018
TCHES
2018
TCHES
Rhythmic Keccak: SCA Security and Low Latency in HW 📺
Glitches entail a great issue when securing a cryptographic implementation in hardware. Several masking schemes have been proposed in the literature that provide security even in the presence of glitches. The key property that allows this protection was introduced in threshold implementations as non-completeness. We address crucial points to ensure the right compliance of this property especially for low-latency implementations. Specifically, we first discuss the existence of a flaw in DSD 2017 implementation of Keccak by Gross et al. in violation of the non-completeness property and propose a solution. We perform a side-channel evaluation on the first-order and second-order implementations of the proposed design where no leakage is detected with up to 55 million traces. Then, we present a method to ensure a non-complete scheme of an unrolled implementation applicable to any order of security or algebraic degree of the shared function. By using this method we design a two-rounds unrolled first-order Keccak-
2018
TCHES
Saber on ARM CCA-secure module lattice-based key encapsulation on ARM
The CCA-secure lattice-based post-quantum key encapsulation scheme Saber is a candidate in the NIST’s post-quantum cryptography standardization process. In this paper, we study the implementation aspects of Saber in resourceconstrained microcontrollers from the ARM Cortex-M series which are very popular for realizing IoT applications. In this work, we carefully optimize various parts of Saber for speed and memory. We exploit digital signal processing instructions and efficient memory access for a fast implementation of polynomial multiplication. We also use memory efficient Karatsuba and just-in-time strategy for generating the public matrix of the module lattice to reduce the memory footprint. We also show that our optimizations can be combined with each other seamlessly to provide various speed-memory trade-offs. Our speed optimized software takes just 1,147K, 1,444K, and 1,543K clock cycles on a Cortex-M4 platform for key generation, encapsulation and decapsulation respectively. Our memory efficient software takes 4,786K, 6,328K, and 7,509K clock cycles on an ultra resource-constrained Cortex-M0 platform for key generation, encapsulation, and decapsulation respectively while consuming only 6.2 KB of memory at most. These results show that lattice-based key encapsulation schemes are perfectly practical for securing IoT devices from quantum computing attacks.
2018
TCHES
SAEB: A Lightweight Blockcipher-Based AEAD Mode of Operation 📺
Lightweight cryptography in computationally constrained devices is actively studied. In contrast to advances of lightweight blockcipher in the last decade, lightweight mode of operation is seemingly not so mature, yet it has large impact in performance. Therefore, there is a great demand for lightweight mode of operation, especially that for authenticated encryption with associated data (AEAD). Among many known properties of conventional modes of operation, the following four properties are essential for constrained devices: Minimum State Size: the state size equals to a block size of a blockcipher. Inverse Free: no need for a blockcipher decryption. XOR Only: only XOR is needed in addition to a blockcipher encryption. Online: a data block is processed only once. The properties 1 and 4 contribute to small memory usage, and the properties 2 and 3 contribute to small program/circuit footprint. On top of the above properties, the fifth property regarding associated data (AD) is also important for performance: Efficient Handling of Static AD: static AD can be precomputed. We design a lightweight blockcipher-based AEAD mode of operation called SAEB: the first mode of operation that satisfies all the five properties to the best of our knowledge. Performance of SAEB is evaluated in various software and hardware platforms. The evaluation results show that SAEB outperforms conventional blockcipher-based AEAD modes of operation in various performance metrics for lightweight cryptography.
2018
TCHES
Side-Channel Attacks on Post-Quantum Signature Schemes based on Multivariate Quadratic Equations - Rainbow and UOV -
In this paper, we investigate the security of Rainbow and Unbalanced Oil-and-Vinegar (UOV) signature schemes based on multivariate quadratic equations, which is one of the most promising alternatives for post-quantum signature schemes, against side-channel attacks. We describe correlation power analysis (CPA) on the schemes that yield full secret key recoveries. First, we identify a secret leakage of secret affine maps S and T during matrix-vector products in signing when Rainbow is implemented with equivalent keys rather than random affine maps for optimal implementations. In this case, the simple structure of the equivalent keys leads to the retrieval of the entire secret affine map T. Next, we extend the full secret key recovery to the general case using random affine maps via a hybrid attack: after recovering S by performing CPA, we recover T by mounting algebraic key recovery attacks. We demonstrate how this leakage on Rainbow can be practically exploited on an 8-bit AVR microcontroller using CPA. Consequently, our CPA can be applied to Rainbow-like multi-layered schemes regardless of the use of the simple-structured equivalent keys and UOV-like single layer schemes with the implementations using the equivalent keys of the simple structure. This is the first result on the security of multivariate quadratic equations-based signature schemes using only CPA. Our result can be applied to Rainbow-like multi-layered schemes and UOV-like single layer schemes submitted to NIST for Post-Quantum Cryptography Standardization.
2018
TCHES
SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange
We present high-speed implementations of the post-quantum supersingular isogeny Diffie-Hellman key exchange (SIDH) and the supersingular isogeny key encapsulation (SIKE) protocols for 32-bit ARMv7-A processors with NEON support. The high performance of our implementations is mainly due to carefully optimized multiprecision and modular arithmetic that finely integrates both ARM and NEON instructions in order to reduce the number of pipeline stalls and memory accesses, and a new Montgomery reduction technique that combines the use of the UMAAL instruction with a variant of the hybrid-scanning approach. In addition, we present efficient implementations of SIDH and SIKE for 64-bit ARMv8-A processors, based on a high-speed Montgomery multiplication that leverages the power of 64-bit instructions. Our experimental results consolidate the practicality of supersingular isogeny-based protocols for many real-world applications. For example, a full key-exchange execution of SIDHp503 is performed in about 176 million cycles on an ARM Cortex-A15 from the ARMv7-A family (i.e., 88 milliseconds @2.0GHz). On an ARM Cortex-A72 from the ARMv8-A family, the same operation can be carried out in about 90 million cycles (i.e., 45 milliseconds @1.992GHz). All our software is protected against timing and cache attacks. The techniques for modular multiplication presented in this work have broad applications to other cryptographic schemes.
2018
TCHES
SIFA: Exploiting Ineffective Fault Inductions on Symmetric Cryptography
Since the seminal work of Boneh et al., the threat of fault attacks has been widely known and techniques for fault attacks and countermeasures have been studied extensively. The vast majority of the literature on fault attacks focuses on the ability of fault attacks to change an intermediate value to a faulty one, such as differential fault analysis (DFA), collision fault analysis, statistical fault attack (SFA), fault sensitivity analysis, or differential fault intensity analysis (DFIA). The other aspect of faults—that faults can be induced and do not change a value—has been researched far less. In case of symmetric ciphers, ineffective fault attacks (IFA) exploit this aspect. However, IFA relies on the ability of an attacker to reliably induce reproducible deterministic faults like stuck-at faults on parts of small values (e.g., one bit or byte), which is often considered to be impracticable.As a consequence, most countermeasures against fault attacks do not focus on such attacks, but on attacks exploiting changes of intermediate values and usually try to detect such a change (detection-based), or to destroy the exploitable information if a fault happens (infective countermeasures). Such countermeasures implicitly assume that the release of “fault-free” ciphertexts in the presence of a fault-inducing attacker does not reveal any exploitable information. In this work, we show that this assumption is not valid and we present novel fault attacks that work in the presence of detection-based and infective countermeasures. The attacks exploit the fact that intermediate values leading to “fault-free” ciphertexts show a non-uniform distribution, while they should be distributed uniformly. The presented attacks are entirely practical and are demonstrated to work for software implementations of AES and for a hardware co-processor. These practical attacks rely on fault induction by means of clock glitches and hence, are achieved using only low-cost equipment. This is feasible because our attack is very robust under noisy fault induction attempts and does not require the attacker to model or profile the exact fault effect. We target two types of countermeasures as examples: simple time redundancy with comparison and several infective countermeasures. However, our attacks can be applied to a wider range of countermeasures and are not restricted to these two countermeasures.
2018
TCHES
Smashing the Implementation Records of AES S-box 📺
Canright S-box has been known as the most compact S-box design since its introduction back in CHES’05. Boyar-Peralta proposed logic-minimization heuristics that could reduce the gate count of Canright S-box from 120 gates to 113 gates, however synthesis results did not reflect much improvement. In CHES’15, Ueno et al. proposed an S-box that has a slightly higher area, but significantly faster than the previous designs, hence it was the most efficient (measured by area×delay) S-box implementation to date. In this paper, we propose two new designs for the AES S-box. One design has a smaller implementation area than both Canright and the 113-gate S-boxes. Hence, our first design is the smallest AES S-box to date, breaking the 13 years implementation record of Canright. The second design is faster and smaller than the Ueno S-box. Hence, our second design is both the fastest and the most efficient S-box design to date. While doing so, we also propose new logicminimization heuristics that outperform the previous algorithms of Boyar-Peralta. Finally, we conduct an exhaustive evaluation of each and every block in the S-box circuit, using both structural and behavioral HDL modeling, to reach the optimum synergy between theoretical algorithms and technology-supported optimization tools. We show that involving the technology-supported CAD tools in the analysis results in several counter-intuitive results.
2018
TCHES
Spin Me Right Round Rotational Symmetry for FPGA-Specific AES
The effort in reducing the area of AES implementations has largely been focused on Application-Specific Integrated Circuits (ASICs) in which a tower field construction leads to a small design of the AES S-box. In contrast, a naïve implementation of the AES S-box has been the status-quo on Field-Programmable Gate Arrays (FPGAs). A similar discrepancy holds for masking schemes – a wellknown side-channel analysis countermeasure – which are commonly optimized to achieve minimal area in ASICs.In this paper we demonstrate a representation of the AES S-box exploiting rotational symmetry which leads to a 50% reduction of the area footprint on FPGA devices. We present new AES implementations which improve on the state of the art and explore various trade-offs between area and latency. For instance, at the cost of increasing 4.5 times the latency, one of our design variants requires 25% less look-up tables (LUTs) than the smallest known AES on Xilinx FPGAs by Sasdrich and Güneysu at ASAP 2016. We further explore the protection of such implementations against first-order side-channel analysis attacks. Targeting the small area footprint on FPGAs, we introduce a heuristic-based algorithm to find a masking of a given function with d + 1 shares. Its application to our new construction of the AES S-box allows us to introduce the smallest masked AES implementation on Xilinx FPGAs, to-date.
2018
TCHES
Standard Lattice-Based Key Encapsulation on Embedded Devices
Lattice-based cryptography is one of the most promising candidates being considered to replace current public-key systems in the era of quantum computing. In 2016, Bos et al. proposed the key exchange scheme FrodoCCS, that is also a submission to the NIST post-quantum standardization process, modified as a key encapsulation mechanism (FrodoKEM). The security of the scheme is based on standard lattices and the learning with errors problem. Due to the large parameters, standard latticebased schemes have long been considered impractical on embedded devices. The FrodoKEM proposal actually comes with parameters that bring standard lattice-based cryptography within reach of being feasible on constrained devices. In this work, we take the final step of efficiently implementing the scheme on a low-cost FPGA and microcontroller devices and thus making conservative post-quantum cryptography practical on small devices. Our FPGA implementation of the decapsulation (the computationally most expensive operation) needs 7,220 look-up tables (LUTs), 3,549 flip-flops (FFs), a single DSP, and only 16 block RAM modules. The maximum clock frequency is 162 MHz and it takes 20.7 ms for the execution of the decapsulation. Our microcontroller implementation has a 66% reduced peak stack usage in comparison to the reference implementation and needs 266 ms for key pair generation, 284 ms for encapsulation, and 286 ms for decapsulation. Our results contribute to the practical evaluation of a post-quantum standardization candidate.
2018
TCHES
Stealthy Opaque Predicates in Hardware - Obfuscating Constant Expressions at Negligible Overhead 📺
Opaque predicates are a well-established fundamental building block for software obfuscation. Simplified, an opaque predicate implements an expression that provides constant Boolean output, but appears to have dynamic behavior for static analysis. Even though there has been extensive research regarding opaque predicates in software, techniques for opaque predicates in hardware are barely explored. In this work, we propose a novel technique to instantiate opaque predicates in hardware, such that they (1) are resource-efficient, and (2) are challenging to reverse engineer even with dynamic analysis capabilities. We demonstrate the applicability of opaque predicates in hardware for both, protection of intellectual property and obfuscation of cryptographic hardware Trojans. Our results show that we are able to implement stealthy opaque predicates in hardware with minimal overhead in area and no impact on latency.