Johann Großschädl

CryptoDB

Johann Großschädl

Publications and invited talks

Year

Venue

Title

2025

TCHES

Fast AVX-512 Implementation of the Optimal Ate Pairing on BLS12-381 Abstract

Hao Cheng Georgios Fotiadis Johann Großschädl Daniel Page

Non-degenerate bilinear maps on elliptic curves, commonly referred to as pairings, have many applications including short signature schemes, zero-knowledge proofs and remote attestation protocols. Computing a state-of-the-art pairing at the 128-bit security level, such as the optimal ate pairing over the curve BLS12-381, is very costly due to the high complexity of some of its sub-operations: most notable are the Miller loop and final exponentiation. In the past ten years, a few optimized pairing implementations have been introduced in the literature, but none of those took advantage of the vector (SIMD) extensions of state-of-the-art Intel and AMD CPUs, especially AVX-512; this is surprising, because doing so offers the potential to reach significant speed-ups. Consequently, the questions of 1) how computation of the optimal ate pairing can be effectively vectorized, and 2) what execution time such a vectorized implementation can achieve are still open. This paper addresses said questions by introducing a carefully-optimized AVX-512 implementation of the optimal ate pairing on BLS12-381. A central feature of the implementation is the use of 8-way Integer Fused Multiply-Add (IFMA) instructions, which are capable to execute eight 52 x 52-bit multiplications in a SIMD-parallel fashion. We introduce new vectorization strategies and describe optimizations of existing ones to speed up arithmetic operations in the extension fields Fp4 , Fp6 , and Fp12 as well as certain higher-level functions. Furthermore, we discuss some parallelization bottlenecks and how they impact execution time. We benchmarked our pairing software, which we call avxbls, on an Intel Core i3-1005G1 (“Ice Lake”) CPU and found that it needs 1, 265, 314 clock cycles (resp. 1, 195, 236 clock cycles) for the full pairing, with the Granger-Scott cyclotomic squaring (resp. compressed cyclotomic squaring) being used in the final exponentiation. For comparison, the non-vectorized (i.e., scalar) x64 assembly implementation from the widely-used blst library has an execution time of 2, 351, 615 cycles, which is 1.86 times (resp. 1.97 times) slower. avxbls also outperforms Longa’s implementation (CHES 2023) by almost the same factor. The practical importance of these results is amplified by Intel’s recent announcement to support AVX10, which includes IFMA instructions, in all future CPUs.

2022

TCHES

Highly Vectorized SIKE for AVX-512 Abstract

Hao Cheng Georgios Fotiadis Johann Großschädl Peter Y. A. Ryan

It is generally accepted that a large-scale quantum computer would be capable to break any public-key cryptosystem used today, thereby posing a serious threat to the security of the Internet’s public-key infrastructure. The US National Institute of Standards and Technology (NIST) addresses this threat with an open process for the standardization of quantum-safe key establishment and signature schemes, which is now in the final phase of the evaluation of candidates. SIKE (an abbreviation of Supersingular Isogeny Key Encapsulation) is one of the alternate candidates under evaluation and distinguishes itself from other candidates due to relatively short key lengths and relatively high computing costs. In this paper, we analyze how the latest generation of Intel’s Advanced Vector Extensions (AVX), in particular AVX-512IFMA, can be used to minimize the latency (resp. maximize the hroughput) of the SIKE key encapsulation mechanism when executed on Ice Lake CPUs based on the Sunny Cove microarchitecture. We present various techniques to parallelize and speed up the base/extension field arithmetic, point arithmetic, and isogeny computations performed by SIKE. All these parallel processing techniques are combined in AvxSike, a highly optimized implementation of SIKE using Intel AVX-512IFMA instructions. Our experiments indicate that AvxSike instantiated with the SIKEp503 parameter set is approximately 1.5 times faster than the to-date best AVX-512IFMA-based SIKE software from the literature. When executed on an Intel Core i3-1005G1 CPU, AvxSike outperforms the x64 assembly implementation of SIKE contained in Microsoft’s SIDHv3.4 library by a factor of about 2.5 for key generation and decapsulation, while the encapsulation is even 3.2 times faster.

2022

TCHES

RISC-V Instruction Set Extensions for Lightweight Symmetric Cryptography Abstract

Hao Cheng Johann Großschädl Ben Marshall Dan Page Thinh Pham

The NIST LightWeight Cryptography (LWC) selection process aims to standardise cryptographic functionality which is suitable for resource-constrained devices. Since the outcome is likely to have significant, long-lived impact, careful evaluation of each submission with respect to metrics explicitly outlined in the call is imperative. Beyond the robustness of submissions against cryptanalytic attack, metrics related to their implementation (e.g., execution latency and memory footprint) form an important example. Aiming to provide evidence allowing richer evaluation with respect to such metrics, this paper presents the design, implementation, and evaluation of one separate Instruction Set Extension (ISE) for each of the 10 LWC final round submissions, namely Ascon, Elephant, GIFT-COFB, Grain-128AEADv2, ISAP, PHOTON-Beetle, Romulus, Sparkle, TinyJAMBU, and Xoodyak; although we base the work on use of RISC-V, we argue that it provides more general insight.

2021

TCHES

An Instruction Set Extension to Support Software-Based Masking 📺 Abstract

Si Gao Johann Großschädl Ben Marshall Daniel Page Thinh Pham Francesco Regazzoni

In both hardware and software, masking can represent an effective means of hardening an implementation against side-channel attack vectors such as Differential Power Analysis (DPA). Focusing on software, however, the use of masking can present various challenges: specifically, it often 1) requires significant effort to translate any theoretical security properties into practice, and, even then, 2) imposes a significant overhead in terms of efficiency. To address both challenges, this paper explores the use of an Instruction Set Extension (ISE) to support masking in software-based implementations of a range of (symmetric) cryptographic kernels including AES: we design, implement, and evaluate such an ISE, using RISC-V as the base ISA. Our ISE-supported first-order masked implementation of AES, for example, is an order of magnitude more efficient than a software-only alternative with respect to both execution latency and memory footprint; this renders it comparable to an unmasked implementation using the same metrics, but also first-order secure.

2021

TCHES

Batching CSIDH Group Actions using AVX-512 📺 Abstract

Hao Cheng Georgios Fotiadis Johann Großschädl Peter Y. A. Ryan Peter B. Rønne

Commutative Supersingular Isogeny Diffie-Hellman (or CSIDH for short) is a recently-proposed post-quantum key establishment scheme that belongs to the family of isogeny-based cryptosystems. The CSIDH protocol is based on the action of an ideal class group on a set of supersingular elliptic curves and comes with some very attractive features, e.g. the ability to serve as a “drop-in” replacement for the standard elliptic curve Diffie-Hellman protocol. Unfortunately, the execution time of CSIDH is prohibitively high for many real-world applications, mainly due to the enormous computational cost of the underlying group action. Consequently, there is a strong demand for optimizations that increase the efficiency of the class group action evaluation, which is not only important for CSIDH, but also for related cryptosystems like the signature schemes CSI-FiSh and SeaSign. In this paper, we explore how the AVX-512 vector extensions (incl. AVX-512F and AVX-512IFMA) can be utilized to optimize constant-time evaluation of the CSIDH-512 class group action with the goal of, respectively, maximizing throughput and minimizing latency. We introduce different approaches for batching group actions and computing them in SIMD fashion on modern Intel processors. In particular, we present a hybrid batching technique that, when combined with optimized (8 × 1)-way prime-field arithmetic, increases the throughput by a factor of 3.64 compared to a state-of-the-art (non-vectorized) x64 implementation. On the other hand, vectorization in a 2-way fashion aimed to reduce latency makes our AVX-512 implementation of the group action evaluation about 1.54 times faster than the state-of-the-art. To the best of our knowledge, this paper is the first to demonstrate the high potential of using vector instructions to increase the throughput (resp. decrease the latency) of constant-time CSIDH.

2020

TOSC

Lightweight AEAD and Hashing using the Sparkle Permutation Family 📺 Abstract

Christof Beierle Alex Biryukov Luan Cardoso dos Santos Johann Großschädl Léo Perrin Aleksei Udovenko Vesselin Velichkov Qingju Wang

We introduce the Sparkle family of permutations operating on 256, 384 and 512 bits. These are combined with the Beetle mode to construct a family of authenticated ciphers, Schwaemm, with security levels ranging from 120 to 250 bits. We also use them to build new sponge-based hash functions, Esch256 and Esch384. Our permutations are among those with the lowest footprint in software, without sacrificing throughput. These properties are allowed by our use of an ARX component (the Alzette S-box) as well as a carefully chosen number of rounds. The corresponding analysis is enabled by the long trail strategy which gives us the tools we need to efficiently bound the probability of all the differential and linear trails for an arbitrary number of rounds. We also present a new application of this approach where the only trails considered are those mapping the rate to the outer part of the internal state, such trails being the only relevant trails for instance in a differential collision attack. To further decrease the number of rounds without compromising security, we modify the message injection in the classical sponge construction to break the alignment between the rate and our S-box layer.

2020

CRYPTO

Alzette: a 64-bit ARX-box (feat. CRAX and TRAX) 📺 Abstract

Christof Beierle Alex Biryukov Luan Cardoso dos Santos Johann Großschädl Léo Perrin Aleksei Udovenko Vesselin Velichkov Qingju Wang

S-boxes are the only source of non-linearity in many symmetric cryptographic primitives. While they are often defined as being functions operating on a small space, some recent designs propose the use of much larger ones (e.g., 32 bits). In this context, an S-box is then defined as a subfunction whose cryptographic properties can be estimated precisely. In this paper, we present a 64-bit ARX-based S-box called Alzette which can be evaluated in constant time using only 12 instructions on modern CPUs. Its parallel application can also leverage vector (SIMD) instructions. One iteration of Alzette has differential and linear properties comparable to those of the AES S-box, while two iterations are at least as secure as the AES super S-box. Since the state size is much larger than the typical 4 or 8 bits, the study of the relevant cryptographic properties of Alzette is not trivial. We further discuss how such wide S-boxes could be used to construct round functions of 64-, 128- and 256-bit (tweakable) block ciphers with good cryptographic properties that are guaranteed even in the related-tweak setting. We use these structures to design a very lightweight 64-bit block cipher (CRAX) which outerperforms SPECK-64/128 for short messages on micro-controllers, and a 256-bit tweakable block cipher (TRAX) which can be used to obtain strong security guarantees against powerful adversaries (nonce misuse, quantum attacks).

2016

ASIACRYPT

Design Strategies for ARX with Provable Bounds: Sparx and LAX

Daniel Dinu Léo Perrin Aleksei Udovenko Vesselin Velichkov Johann Großschädl Alex Biryukov

2015

FSE

Conversion from Arithmetic to Boolean Masking with Logarithmic Complexity

Jean-Sébastien Coron Johann Großschädl Mehdi Tibouchi Praveen Kumar Vadnala

2015

CHES

Efficient Ring-LWE Encryption on 8-Bit AVR Processors

Zhe Liu Hwajeong Seo Sujoy Sinha Roy Johann Großschädl Howon Kim Ingrid Verbauwhede