Hao Cheng

CryptoDB

Hao Cheng

Publications and invited talks

Year

Venue

Title

2025

TCHES

A Code-Based ISE to Protect Boolean Masking in Software Abstract

Qi Tian Hao Cheng Chun Guo Daniel Page Meiqin Wang Weija Wang

Side-Channel Attacks (SCAs) pose a significant threat to data security in embedded environments. To counteract the power-based SCAs, masking is a widely used defense technique, that introduces randomness to obscure the sidechannel information generated during the processing of secret data. However, in practice, some challenges exist when implementing masking schemes. For example, in the implementation of Boolean masking, they may refer to low noise level and implementation flaws. To address the said implementation challenges, we present an effective and efficient solution that incorporates the code-based masking technique: We mask the shares of Boolean masking with code-based masking and then use a selfdesigned Instruction Set Extension (ISE) to perform efficient private computations within this masked domain. Based on a 32-bit RISC-V Ibex core, we develop a prototype implementation of our ISE, whereby it mainly wraps the ALU with three code-based encoders/decoders and integrates a leakage-resilient pseudo-random generator (PRG). Compared to the base core (vanilla Ibex), the hardware overhead of the ISE implementation is only 8%. The security evaluation based on formal verification and practical evaluation demonstrates that our ISE can provide a more robust practical security guarantee. Furthermore, our approach significantly reduces the signal-to-noise ratio (SNR) of each share, decreasing it to just 2% of the original SNR on the base core.

2025

TCHES

Fast AVX-512 Implementation of the Optimal Ate Pairing on BLS12-381 Abstract

Hao Cheng Georgios Fotiadis Johann Großschädl Daniel Page

Non-degenerate bilinear maps on elliptic curves, commonly referred to as pairings, have many applications including short signature schemes, zero-knowledge proofs and remote attestation protocols. Computing a state-of-the-art pairing at the 128-bit security level, such as the optimal ate pairing over the curve BLS12-381, is very costly due to the high complexity of some of its sub-operations: most notable are the Miller loop and final exponentiation. In the past ten years, a few optimized pairing implementations have been introduced in the literature, but none of those took advantage of the vector (SIMD) extensions of state-of-the-art Intel and AMD CPUs, especially AVX-512; this is surprising, because doing so offers the potential to reach significant speed-ups. Consequently, the questions of 1) how computation of the optimal ate pairing can be effectively vectorized, and 2) what execution time such a vectorized implementation can achieve are still open. This paper addresses said questions by introducing a carefully-optimized AVX-512 implementation of the optimal ate pairing on BLS12-381. A central feature of the implementation is the use of 8-way Integer Fused Multiply-Add (IFMA) instructions, which are capable to execute eight 52 x 52-bit multiplications in a SIMD-parallel fashion. We introduce new vectorization strategies and describe optimizations of existing ones to speed up arithmetic operations in the extension fields Fp4 , Fp6 , and Fp12 as well as certain higher-level functions. Furthermore, we discuss some parallelization bottlenecks and how they impact execution time. We benchmarked our pairing software, which we call avxbls, on an Intel Core i3-1005G1 (“Ice Lake”) CPU and found that it needs 1, 265, 314 clock cycles (resp. 1, 195, 236 clock cycles) for the full pairing, with the Granger-Scott cyclotomic squaring (resp. compressed cyclotomic squaring) being used in the final exponentiation. For comparison, the non-vectorized (i.e., scalar) x64 assembly implementation from the widely-used blst library has an execution time of 2, 351, 615 cycles, which is 1.86 times (resp. 1.97 times) slower. avxbls also outperforms Longa’s implementation (CHES 2023) by almost the same factor. The practical importance of these results is amplified by Intel’s recent announcement to support AVX10, which includes IFMA instructions, in all future CPUs.

2024

TCHES

eLIMInate: a Leakage-focused ISE for Masked Implementation Abstract

Hao Cheng Daniel Page Weijia Wang

Even given a state-of-the-art masking scheme, masked software implementation of some cryptography functionality can pose significant challenges stemming, e.g., from simultaneous requirements for efficiency and security. In this paper we design an Instruction Set Extension (ISE) to address a specific element of said challenge, namely the elimination of leakage stemming from architectural and microarchitectural overwriting. Conceptually, the ISE allows a leakage-focused behavioural hint to be communicated from software to the micro-architecture: using it informs how computation is realised when applied to masking-specific data, which then offers an opportunity to eliminate associated leakage. We develop prototype, latencyand area-optimised implementations of the ISE design based on the RISC-V Ibex core. Using them, we demonstrate that use of the ISE can close the gap between assumptions about and actual behaviour of a device and thereby deliver an improved security guarantee.

2022

TCHES

Highly Vectorized SIKE for AVX-512 Abstract

Hao Cheng Georgios Fotiadis Johann Großschädl Peter Y. A. Ryan

It is generally accepted that a large-scale quantum computer would be capable to break any public-key cryptosystem used today, thereby posing a serious threat to the security of the Internet’s public-key infrastructure. The US National Institute of Standards and Technology (NIST) addresses this threat with an open process for the standardization of quantum-safe key establishment and signature schemes, which is now in the final phase of the evaluation of candidates. SIKE (an abbreviation of Supersingular Isogeny Key Encapsulation) is one of the alternate candidates under evaluation and distinguishes itself from other candidates due to relatively short key lengths and relatively high computing costs. In this paper, we analyze how the latest generation of Intel’s Advanced Vector Extensions (AVX), in particular AVX-512IFMA, can be used to minimize the latency (resp. maximize the hroughput) of the SIKE key encapsulation mechanism when executed on Ice Lake CPUs based on the Sunny Cove microarchitecture. We present various techniques to parallelize and speed up the base/extension field arithmetic, point arithmetic, and isogeny computations performed by SIKE. All these parallel processing techniques are combined in AvxSike, a highly optimized implementation of SIKE using Intel AVX-512IFMA instructions. Our experiments indicate that AvxSike instantiated with the SIKEp503 parameter set is approximately 1.5 times faster than the to-date best AVX-512IFMA-based SIKE software from the literature. When executed on an Intel Core i3-1005G1 CPU, AvxSike outperforms the x64 assembly implementation of SIKE contained in Microsoft’s SIDHv3.4 library by a factor of about 2.5 for key generation and decapsulation, while the encapsulation is even 3.2 times faster.

2022

TCHES

RISC-V Instruction Set Extensions for Lightweight Symmetric Cryptography Abstract

Hao Cheng Johann Großschädl Ben Marshall Dan Page Thinh Pham

The NIST LightWeight Cryptography (LWC) selection process aims to standardise cryptographic functionality which is suitable for resource-constrained devices. Since the outcome is likely to have significant, long-lived impact, careful evaluation of each submission with respect to metrics explicitly outlined in the call is imperative. Beyond the robustness of submissions against cryptanalytic attack, metrics related to their implementation (e.g., execution latency and memory footprint) form an important example. Aiming to provide evidence allowing richer evaluation with respect to such metrics, this paper presents the design, implementation, and evaluation of one separate Instruction Set Extension (ISE) for each of the 10 LWC final round submissions, namely Ascon, Elephant, GIFT-COFB, Grain-128AEADv2, ISAP, PHOTON-Beetle, Romulus, Sparkle, TinyJAMBU, and Xoodyak; although we base the work on use of RISC-V, we argue that it provides more general insight.

2021

TCHES

Batching CSIDH Group Actions using AVX-512 📺 Abstract

Hao Cheng Georgios Fotiadis Johann Großschädl Peter Y. A. Ryan Peter B. Rønne

Commutative Supersingular Isogeny Diffie-Hellman (or CSIDH for short) is a recently-proposed post-quantum key establishment scheme that belongs to the family of isogeny-based cryptosystems. The CSIDH protocol is based on the action of an ideal class group on a set of supersingular elliptic curves and comes with some very attractive features, e.g. the ability to serve as a “drop-in” replacement for the standard elliptic curve Diffie-Hellman protocol. Unfortunately, the execution time of CSIDH is prohibitively high for many real-world applications, mainly due to the enormous computational cost of the underlying group action. Consequently, there is a strong demand for optimizations that increase the efficiency of the class group action evaluation, which is not only important for CSIDH, but also for related cryptosystems like the signature schemes CSI-FiSh and SeaSign. In this paper, we explore how the AVX-512 vector extensions (incl. AVX-512F and AVX-512IFMA) can be utilized to optimize constant-time evaluation of the CSIDH-512 class group action with the goal of, respectively, maximizing throughput and minimizing latency. We introduce different approaches for batching group actions and computing them in SIMD fashion on modern Intel processors. In particular, we present a hybrid batching technique that, when combined with optimized (8 × 1)-way prime-field arithmetic, increases the throughput by a factor of 3.64 compared to a state-of-the-art (non-vectorized) x64 implementation. On the other hand, vectorization in a 2-way fashion aimed to reduce latency makes our AVX-512 implementation of the group action evaluation about 1.54 times faster than the state-of-the-art. To the best of our knowledge, this paper is the first to demonstrate the high potential of using vector instructions to increase the throughput (resp. decrease the latency) of constant-time CSIDH.