Georgios Fotiadis

CryptoDB

Georgios Fotiadis

Publications and invited talks

Year

Venue

Title

2025

TCHES

Fast AVX-512 Implementation of the Optimal Ate Pairing on BLS12-381 Abstract

Hao Cheng Georgios Fotiadis Johann Großschädl Daniel Page

Non-degenerate bilinear maps on elliptic curves, commonly referred to as pairings, have many applications including short signature schemes, zero-knowledge proofs and remote attestation protocols. Computing a state-of-the-art pairing at the 128-bit security level, such as the optimal ate pairing over the curve BLS12-381, is very costly due to the high complexity of some of its sub-operations: most notable are the Miller loop and final exponentiation. In the past ten years, a few optimized pairing implementations have been introduced in the literature, but none of those took advantage of the vector (SIMD) extensions of state-of-the-art Intel and AMD CPUs, especially AVX-512; this is surprising, because doing so offers the potential to reach significant speed-ups. Consequently, the questions of 1) how computation of the optimal ate pairing can be effectively vectorized, and 2) what execution time such a vectorized implementation can achieve are still open. This paper addresses said questions by introducing a carefully-optimized AVX-512 implementation of the optimal ate pairing on BLS12-381. A central feature of the implementation is the use of 8-way Integer Fused Multiply-Add (IFMA) instructions, which are capable to execute eight 52 x 52-bit multiplications in a SIMD-parallel fashion. We introduce new vectorization strategies and describe optimizations of existing ones to speed up arithmetic operations in the extension fields Fp4 , Fp6 , and Fp12 as well as certain higher-level functions. Furthermore, we discuss some parallelization bottlenecks and how they impact execution time. We benchmarked our pairing software, which we call avxbls, on an Intel Core i3-1005G1 (“Ice Lake”) CPU and found that it needs 1, 265, 314 clock cycles (resp. 1, 195, 236 clock cycles) for the full pairing, with the Granger-Scott cyclotomic squaring (resp. compressed cyclotomic squaring) being used in the final exponentiation. For comparison, the non-vectorized (i.e., scalar) x64 assembly implementation from the widely-used blst library has an execution time of 2, 351, 615 cycles, which is 1.86 times (resp. 1.97 times) slower. avxbls also outperforms Longa’s implementation (CHES 2023) by almost the same factor. The practical importance of these results is amplified by Intel’s recent announcement to support AVX10, which includes IFMA instructions, in all future CPUs.

2024

CIC

A short-list of pairing-friendly curves resistant to the Special TNFS algorithm at the 192-bit security level Abstract

Diego F. Aranha Georgios Fotiadis Aurore Guillevic

<p>For more than two decades, pairings have been a fundamental tool for designing elegant cryptosystems, varying from digital signature schemes to more complex privacy-preserving constructions. However, the advancement of quantum computing threatens to undermine public-key cryptography. Concretely, it is widely accepted that a future large-scale quantum computer would be capable to break any public-key cryptosystem used today, rendering today's public-key cryptography obsolete and mandating the transition to quantum-safe cryptographic solutions. This necessity is enforced by numerous recognized government bodies around the world, including NIST which initiated the first open competition in standardizing post-quantum (PQ) cryptographic schemes, focusing primarily on digital signatures and key encapsulation/public-key encryption schemes. Despite the current efforts in standardizing PQ primitives, the landscape of complex, privacy-preserving cryptographic protocols, e.g., zkSNARKs/zkSTARKs, is at an early stage. Existing solutions suffer from various disadvantages in terms of efficiency and compactness and in addition, they need to undergo the required scrutiny to gain the necessary trust in the academic and industrial domains. Therefore, it is believed that the migration to purely quantum-safe cryptography would require an intermediate step where current classically secure protocols and quantum-safe solutions will co-exist. This is enforced by the report of the Commercial National Security Algorithm Suite version 2.0, mandating transition to quantum-safe cryptographic algorithms by 2033 and suggesting to incorporate ECC at 192-bit security in the meantime. To this end, the present paper aims at providing a comprehensive study on pairings at 192-bit security level. We start with an exhaustive review in the literature to search for all possible recommendations of such pairing constructions, from which we extract the most promising candidates in terms of efficiency and security, with respect to the advanced Special TNFS attacks. Our analysis is focused, not only on the pairing computation itself, but on additional operations that are relevant in pairing-based applications, such as hashing to pairing groups, cofactor clearing and subgroup membership testing. We implement all functionalities of the most promising candidates within the RELIC cryptographic toolkit in order to identify the most efficient pairing implementation at 192-bit security and provide extensive experimental results. </p>

2022

TCHES

Highly Vectorized SIKE for AVX-512 Abstract

Hao Cheng Georgios Fotiadis Johann Großschädl Peter Y. A. Ryan

It is generally accepted that a large-scale quantum computer would be capable to break any public-key cryptosystem used today, thereby posing a serious threat to the security of the Internet’s public-key infrastructure. The US National Institute of Standards and Technology (NIST) addresses this threat with an open process for the standardization of quantum-safe key establishment and signature schemes, which is now in the final phase of the evaluation of candidates. SIKE (an abbreviation of Supersingular Isogeny Key Encapsulation) is one of the alternate candidates under evaluation and distinguishes itself from other candidates due to relatively short key lengths and relatively high computing costs. In this paper, we analyze how the latest generation of Intel’s Advanced Vector Extensions (AVX), in particular AVX-512IFMA, can be used to minimize the latency (resp. maximize the hroughput) of the SIKE key encapsulation mechanism when executed on Ice Lake CPUs based on the Sunny Cove microarchitecture. We present various techniques to parallelize and speed up the base/extension field arithmetic, point arithmetic, and isogeny computations performed by SIKE. All these parallel processing techniques are combined in AvxSike, a highly optimized implementation of SIKE using Intel AVX-512IFMA instructions. Our experiments indicate that AvxSike instantiated with the SIKEp503 parameter set is approximately 1.5 times faster than the to-date best AVX-512IFMA-based SIKE software from the literature. When executed on an Intel Core i3-1005G1 CPU, AvxSike outperforms the x64 assembly implementation of SIKE contained in Microsoft’s SIDHv3.4 library by a factor of about 2.5 for key generation and decapsulation, while the encapsulation is even 3.2 times faster.

2021

TCHES

Batching CSIDH Group Actions using AVX-512 📺 Abstract

Hao Cheng Georgios Fotiadis Johann Großschädl Peter Y. A. Ryan Peter B. Rønne

Commutative Supersingular Isogeny Diffie-Hellman (or CSIDH for short) is a recently-proposed post-quantum key establishment scheme that belongs to the family of isogeny-based cryptosystems. The CSIDH protocol is based on the action of an ideal class group on a set of supersingular elliptic curves and comes with some very attractive features, e.g. the ability to serve as a “drop-in” replacement for the standard elliptic curve Diffie-Hellman protocol. Unfortunately, the execution time of CSIDH is prohibitively high for many real-world applications, mainly due to the enormous computational cost of the underlying group action. Consequently, there is a strong demand for optimizations that increase the efficiency of the class group action evaluation, which is not only important for CSIDH, but also for related cryptosystems like the signature schemes CSI-FiSh and SeaSign. In this paper, we explore how the AVX-512 vector extensions (incl. AVX-512F and AVX-512IFMA) can be utilized to optimize constant-time evaluation of the CSIDH-512 class group action with the goal of, respectively, maximizing throughput and minimizing latency. We introduce different approaches for batching group actions and computing them in SIMD fashion on modern Intel processors. In particular, we present a hybrid batching technique that, when combined with optimized (8 × 1)-way prime-field arithmetic, increases the throughput by a factor of 3.64 compared to a state-of-the-art (non-vectorized) x64 implementation. On the other hand, vectorization in a 2-way fashion aimed to reduce latency makes our AVX-512 implementation of the group action evaluation about 1.54 times faster than the state-of-the-art. To the best of our knowledge, this paper is the first to demonstrate the high potential of using vector instructions to increase the throughput (resp. decrease the latency) of constant-time CSIDH.