International Association for Cryptologic Research

International Association
for Cryptologic Research

CryptoDB

Papers from Transactions on Cryptographic Hardware and Embedded Systems 2025

Year
Venue
Title
2025
TCHES
A Code-Based ISE to Protect Boolean Masking in Software
Side-Channel Attacks (SCAs) pose a significant threat to data security in embedded environments. To counteract the power-based SCAs, masking is a widely used defense technique, that introduces randomness to obscure the sidechannel information generated during the processing of secret data. However, in practice, some challenges exist when implementing masking schemes. For example, in the implementation of Boolean masking, they may refer to low noise level and implementation flaws. To address the said implementation challenges, we present an effective and efficient solution that incorporates the code-based masking technique: We mask the shares of Boolean masking with code-based masking and then use a selfdesigned Instruction Set Extension (ISE) to perform efficient private computations within this masked domain. Based on a 32-bit RISC-V Ibex core, we develop a prototype implementation of our ISE, whereby it mainly wraps the ALU with three code-based encoders/decoders and integrates a leakage-resilient pseudo-random generator (PRG). Compared to the base core (vanilla Ibex), the hardware overhead of the ISE implementation is only 8%. The security evaluation based on formal verification and practical evaluation demonstrates that our ISE can provide a more robust practical security guarantee. Furthermore, our approach significantly reduces the signal-to-noise ratio (SNR) of each share, decreasing it to just 2% of the original SNR on the base core.
2025
TCHES
A TRAP for SAT: On the Imperviousness of a Transistor-Level Programmable Fabric to Satisfiability-Based Attacks
Locking-based intellectual property (IP) protection for integrated circuits (ICs) being manufactured at untrusted facilities has been largely defeated by the satisfiability (SAT) attack, which can retrieve the secret key needed for instantiating proprietary functionality on locked circuits. As a result, redaction-based methods have gained popularity as a more secure way of protecting hardware IP. Among these methods, transistor-level programming (TRAP) prohibits the outright use of SAT attacks due to the mismatch between the logic-level at which SAT attack operates and the switch-level at which the TRAP fabric is programmed. Herein, we discuss the challenges involved in launching SAT attacks on TRAP and we propose solutions which enable expression of TRAP in propositional logic modeling in a way that accurately reflects switch-level circuit capabilities. Results obtained using a transistor-level SAT attack tool-set that we developed and are releasing corroborate that SAT attacks can be launched against TRAP. However, the increased complexity of switch-level circuit modeling prevents the attack from realistically compromising all but the most trivial IP-protected designs.
2025
TCHES
A5/3 make or break: A massively parallel FPGA architecture for exhaustive key search
In this paper, we have designed and implemented a massively parallel FPGA architecture for exhaustive key search on the A5/3 encryption algorithm. A5/3 is based on KASUMI, it has an effective key of 64 bits, and it is used in GSM (2G) mobile telephony systems. Despite the widespread adoption of more advanced technologies (4G, 5G), 2G networks remain as fallback options. In our novel hardware architecture, we use an AMD-Xilinx Alveo U250 card, with its FPGA configured to operate with 104 cores clocked at 496.7 M Hz, that can evaluate 235.59 keys/sec. Our results show that the $1 million attack can be achieved with 128 Alveo U250 cards, on average, in 16 days.
2025
TCHES
Accelerating EdDSA Signature Verification with Faster Scalar Size Halving
This paper establishes that the extended Euclidean algorithm (EEA) implemented in a division-free manner is faster than the Lagrange algorithm with a similar level of optimization when it comes to halving the size of scalars found in the equations of elliptic curve signature verification. Our implementation results show that our EEA based method achieves roughly 4x speed-up for generating half- size scalars used in EdDSA. For the first time ever, EEA generated half-size scalars are used for verification of individual Ed25519 signatures yielding timing results that outperform ed25519-donna, a highly optimized open source implementation, by 16.12%. We also propose a new randomization method applied with half-size scalars to batch verification of Ed25519 signatures for which we report speed-ups compared to the well-known Bernstein et al. method for batch sizes larger than six, specifically, our method achieves 11.60% improvement for batch size 64.
2025
TCHES
Adaptive Template Attacks on the Kyber Binomial Sampler
Template attacks build a Gaussian multivariate model of the side-channel leakage signal generated by each value of a targeted intermediate variable. Combined with additional steps, such as dimensionality reduction, such models can help to infer a value with nearly 100% accuracy from just a single attack trace. We demonstrate this here by reconstructing the output of the binomial sampler of a Cortex-M4 imple- mentation of the Kyber768 post-quantum key-encapsulation mechanism. However, this performance is usually significantly diminished if the device, or even just the ad- dress space, used for profiling differs from the attacked one. Here we introduce a new technique for adapting templates generated from profiling devices in order to attack another device where we are also able to record many traces, but without knowledge of the random value held by the targeted variable. We interpret the model from the profiling devices as a Gaussian mixture and use the Expectation–Maximization (EM) algorithm to adapt its means and covariances to better match the unlabelled leakage distribution observed from the attacked setting. The Kyber binomial sampler turned out to be a particularly suitable target, for two reasons. Firstly, it generates a long sequence of values drawn from a small set, limiting the number of Gaussian components that need to be adjusted. Secondly, the length of this sequence requires particularly well-adapted templates to achieve a high key-recovery success rate from a single trace. We also introduce an extended point-of-interest selection method to improve linear discriminant analysis (LDA).
2025
TCHES
AETHER: An Ultra-High Throughput and Low Energy Authenticated Encryption Scheme
In this paper, we introduce AETHER, an authenticated encryption scheme that achieves ultra-high throughput and low energy consumption, supporting a 256- bit key and a 128-bit tag. While inspired by an AEGIS-like structure, AETHER stands out with a completely redesigned round-update function. We replace the AES round function with a new inner function optimized for ultra-low latency and energy consumption. This function incorporates Orthros’s S-box and a 16x16 binary matrix from Akleylek et al., leading to a 1.56 times reduction in energy consumption and a 1.25 times reduction in delay compared to the AES round function. To further optimize hardware performance, we design the general construction of the roundupdate function to be more hardware-friendly, allowing parallel execution of the inner function on all 128-bit words, thereby enhancing both throughput and security against collision-based forgery attacks. AETHER achieves a throughput of 2.1 Tbit/s and an energy consumption of only 204.31 nJ, in the Nangate 15 nm standard cell library and a throughput of 5.23 Tbit/s and energy consumption of 1.83 nJ using the CNFET-OCL 5nm library, outperforming all existing AEADs.
2025
TCHES
Algebraic Linear Analysis for Number Theoretic Transform in Lattice-Based Cryptography
The topic of verifying postquantum cryptographic software has never been more pressing than today between the new NIST postquantum cryptosystem standards being finalized and various countries issuing directives to switch to postquantum or at least hybrid cryptography in a decade. One critical issue in verifying lattice-based cryptographic software is range-checking in the finite-field arithmetic assembly code which occurs frequently in highly optimized cryptographic software. For the most part these have been handled by Satisfiability Modulo Theory (SMT) but so far they mostly are restricted to Montgomery arithmetic and 16-bit precision. We add semi-automatic range-check reasoning capability to the CryptoLine toolkit via the Integer Set Library (wrapped via the python package islpy) which makes it easier and faster to verify more arithmetic crypto code, including Barrett and Plantard finite-field arithmetic, and show experimentally that this is viable on production code.
2025
TCHES
All You Need is XOR-Convolution: A Generalized Higher-Order Side-Channel Attack with Application to XEX/XE-based Encryptions
The XEX/XE scheme has been widely used to realize authenticated encryptions (AEs), message authentication codes (MACs), and storage encryptions, such as OCB, PMAC, and XTS. Although these schemes have been extensively deployed in the real world, limited studies have evaluated side-channel attacks (SCAs) on them. In this study, we propose an efficient SCA that can be applied to the XEX/XE scheme. Despite the fact that the offset generated in these modes is guaranteed to have no full offset collision with an overwhelming probability, we analyze their offset-generating routines to exploit the partial offset collisions. Then, we propose a new profiled SCA named XOR-convoluting collision analysis (XCCA), which estimates the sum of keys from two leakages by XOR-convoluting probability distributions that model the leakages. The proposed collision SCA effectively erases the effect of random offsets by using XOR-convolution, whereas conventional collision SCAs are ineffective in this scenario. We validated the proposed SCA through simulations and experimental attacks using real traces. The results confirmed that the proposed SCA reduces the number of traces by up to 90% to achieve a success rate identical to that of a state-of-the-art SCA on OCB in TCHES 2022. Furthermore, we show that the proposed SCA distinguisher (XCCA distinguisher) is a generalization of higher-order SCAs, including non-collision SCAs on masked implementations. The profiled higher-order SCAs on masked implementations can be written in the form of an XCCA distinguisher using XOR-convolution with the new concept of leaking and target selection functions. The generalized representation clarifies how and why a higher-order SCA has better or worse performance from the theoretical viewpoint of noise amplification, which is also demonstrated through experiments and a spectrum analysis based on Walsh–Hadamard transform (WHT). Our analysis reveals that the random offsets of XEX/XE would work as masking from an SCA perspective, and XEX/XE-based encryption would have an inherent first-order SCA resilience under certain conditions.
2025
TCHES
All-You-Can-Compute: Packed Secret Sharing for Combined Resilience
Unprotected cryptographic implementations are vulnerable to implementation attacks, such as passive side-channel attacks and active fault injection attacks. Recently, countermeasures like polynomial masking and duplicated masking have been introduced to protect implementations against combined attacks that exploit leakage and faults simultaneously. While duplicated masking requires O (t · e) shares to resist an adversary capable of probing t values and faulting e values, polynomial masking requires only O (t · e) shares, which is particularly beneficial for affine computation. At CHES’24, Arnold et al. showed how to further improve the efficiency of polynomial masking in the presence of combined attacks by embedding two secrets into one polynomial sharing. This essentially reduces the complexity of previous constructions by half. The authors also observed that using techniques from packed secret sharing (Grosso et al., CHES’13) cannot easily achieve combined resilience to encode an arbitrary number of secrets in one polynomial encoding. In this work, we resolve these challenges and show that it is possible to embed an arbitrary number of secrets in one encoding and propose gadgets that are secure against combined attacks. We present two constructions that are generic and significantly improve the computational and randomness complexity of existing compilers, such as the laOla compiler presented by Berndt et al. at CRYPTO’23 and its improvement by Arnold et al. For example, for an AES evaluation that protects against t probes and e faults, we improve the randomness complexity of the state-of-the-art construction when t + e > 3, leading to an improvement of up to a factor of 2.41.
2025
TCHES
Chameleon: A Dataset for Segmenting and Attacking Obfuscated Power Traces in Side-Channel Analysis
Side-channel attacks exploit unintended information leakage emitted by cryptographic devices to extract sensitive data. Hiding techniques are a cost-effective countermeasure designed to obfuscate the side-channel leakage and hinder these attacks. Available open datasets rely on artificial models to simulate hiding effects, preventing a realistic assessment of these countermeasures and, thus, leaving a pressing need for datasets offering real-world, obfuscated side-channel measurements. Chameleon introduces the first comprehensive dataset of real-world, obfuscated power traces collected from a RISC-V-based System-on-Chip. The traces are obfuscated using four state-of-the-art hiding techniques: dynamic frequency scaling, random delay, morphing, and chaffing. Chameleon captures real leakage deformations introduced by actual hardware implementations, making it a realistic and valuable tool for evaluating side-channel countermeasures. A key feature of Chameleon is its dual focus on the segmentation and attack stages of the side-channel analysis process. It is the first dataset designed to facilitate the challenging task of segmenting cryptographic operations from obfuscated traces, offering precise metadata that pinpoints the start and end of each operation. The high-quality metadata enables systematic research into segmentation techniques, a critical step often overlooked in previous datasets. Chameleon provides an essential platform for researchers to develop and test new side-channel attacks, highlighting the vulnerabilities of current hiding techniques. By offering a more realistic assessment of countermeasure effectiveness, Chameleon is an invaluable tool for advancing the state-of-the-art in the side-channel evaluation.
2025
TCHES
CHERI-Crypt: Transparent Memory Encryption on Capability Architectures
Capability architectures such as CHERI (Capability Hardware Enhanced RISC Instructions) are an emerging technology designed to provide memory safety protection at the hardware level and are equipped to eradicate approximately 70% of the current software vulnerability attack surface. CHERI is an instruction set architecture extension and has been applied to a small number of processors, including various versions of RISC-V. One of the benefits of CHERI is that it inherently provides segregation or compartmentalisation of software, making it suitable for supporting other types of applications such as Trusted Execution Environments, where sensitive data and computation is conducted inside a secure enclave, away from the rest of the untrusted operating system and services. To prevent untrusted software from accessing these compartments or secure regions of memory CHERI uses the mechanism of sealed capabilities. Trusted execution environments however, have been proven vulnerable to not just software-based attacks, but hardware attacks as well. In this paper we present our CHERI-Crypt design, an encryption engine extension to a CHERI-RISC-V 32-bit processor, for transparent memory encryption of sealed CHERI capabilities to additionally protect sensitive data in memory against physical hardware attacks. We show that our CHERI-Crypt design can run an enclave test program within an encrypted CHERI seal and invoke process, requiring 626 additional clock cycles with a batch size of 32 bytes. Adding CHERI-Crypt reduces the maximum frequency of the base CPU by only 6 MHz, and requires approximately 3.5x more flip flops and LUTs.
2025
TCHES
Code-based Masking: From Fields to Bits Bitsliced Higher-Order Masked SKINNY
Masking is one of the most prevalent and investigated countermeasures against side-channel analysis. As an alternative to the simple (e.g., additive) encoding function of Boolean masking, a collection of more algebraically complex masking types has emerged. Recently, inner product masking and the more generic codebased masking have proven to enable higher theoretical security properties than Boolean masking. In CARDIS 2017, Poussier et al. connected this “security order amplification” effect to the bit-probing model, demonstrating that for the same shared size, sharings from more complex encoding functions exhibit greater resistance to higher-order attacks. Despite these advantages, masked gadgets designed for code-based implementations face significant overhead compared to Boolean masking. Furthermore, existing code-based masked gadgets are not designed for efficient bitslice representation, which is highly beneficial for software implementations. Thus, current code-based masked gadgets are constrained to operate over words (e.g., elements in F2k ), limiting their applicability to ciphers where the S-box can be efficiently computed via power functions, such as AES. In this paper, we address the aforementioned limitations. We first introduce foundational masked linear and non-linear circuits that operate over bits of code-based sharings, ensuring composability and preserving bit-probing security, specifically achieving t-Probe Isolating Non-Interference (t-PINI). Utilizing these circuits, we construct masked ciphers that operate over bits, preserving the security order amplification effect during computation. Additionally, we present an optimized bitsliced masked assembly implementation of the SKINNY cipher, which outperforms Boolean masking in terms of randomness and gate count. The third-order security of this implementation is formally proven and validated through practical side-channel leakage evaluations on a Cortex-M4 core, confirming its robustness against leakages up to one million traces.
2025
TCHES
Constant time lattice reduction in dimension 4 with application to SQIsign
In this paper we propose a constant time lattice reduction algorithm for integral dimension-4 lattices. Motivated by its application in the SQIsign postquantum signature scheme, we provide for the first time a constant time LLLlike algorithm with guarantees on the length of the shortest output vector. We implemented our algorithm and ensured through various tools that it indeed operates in constant time. Our experiments suggest that in practice our implementation outputs a Minkowski reduced basis and thus can replace a non constant time lattice reduction subroutine in SQIsign.
2025
TCHES
Cymric: Short-tailed but Mighty: Beyond-birthday-bound Secure Authenticated Encryption for Short Inputs
Authenticated encryption (AE) is a fundamental tool in today’s secure communication. Numerous designs have been proposed, including well-known standards such as GCM. While their performance for long inputs is excellent, that for short inputs is often problematic due to high overhead in computation, showing a gap between the real need for IoT-like protocols where packets are often very short. Existing dedicated short-input AEs are very scarce, the classical Encode-then-encipher (Bellare and Rogaway, Asiacrypt 2000) and Manx (Adomnicăi et al., CT-RSA 2023), using up to two block cipher calls. They have superior performance for (very) short inputs, however, security is up to n/2 bits, where n is the block size of the underlying block cipher. This paper proposes a new family of short-input AEs, dubbed Cymric, which ensure beyond-birthday-bound (BBB) security. It supports a wider range of input space than EtE and Manx with the help of one additional block cipher call (thus three calls). In terms of the number of block cipher calls, Cymric is the known minimum construction of BBB-secure AEs, and we also prove this is indeed minimal by presenting an impossibility result on BBB-secure AE with two calls. Finally, we show a comprehensive benchmark on microcontrollers to show performance advantage over existing schemes.
2025
TCHES
dCTIDH: Fast & Deterministic CTIDH
This paper presents dCTIDH, a CSIDH implementation that combines two recent developments into a novel state-of-the-art deterministic implementation. We combine the approach of deterministic variants of CSIDH with the batching strategy of CTIDH, which shows that the full potential of this key space has not yet been explored. This high-level adjustment in itself leads to a significant speed-up. To achieve an effective deterministic evaluation in constant time, we introduce WOMBats, a new approach to performing isogenies in batches, specifically tailored to the behavior required for deterministic CSIDH using CTIDH batching. Furthermore, we explore the two-dimensional space of optimal primes for dCTIDH, with regard to both the performance of dCTIDH in terms of finite-field operations per prime and the efficiency of finite-field operations, determined by the prime shape, in terms of cycles. This allows us to optimize both for choice of prime and scheme parameters simultaneously. Lastly, we implement and benchmark constant-time, deterministic dCTIDH. Our results show that dCTIDH not only outperforms state-of-the-art deterministic CSIDH, but even non-deterministic CTIDH: dCTIDH-2048 is faster than CTIDH-2048 by 17%, and is almost five times faster than dCSIDH-2048.
2025
TCHES
Design and Implementation of a Physically Secure Open-Source FPGA and Toolchain
The increasing prevalence of security breaches highlights the importanceof robust hardware security measures. Among these breaches, physical attacks– such as Side-Channel Analysis ( SCA) and Fault Injection (FI ) attacks – posea significant challenge for security-sensitive applications. To ensure robust systemsecurity throughout its lifecycle, hardware security updates are indispensable alongsidesoftware security patches. Programmable hardware plays a pivotal role in establishinga robust hardware root-of-trust, serving to effectively mitigate various hardwaresecurity threats. In this paper, we propose a methodology for the design of areconfigurable fabric and the corresponding mapping toolchain, specifically tailoredto hardware security. This approach offers resistance to various malicious physicalattacks, including SCA and FI , addressing each threat individually. As a case study,we propose a resulting fabric that implements a combination of first-order BooleanMasking and hiding countermeasures to provide strong protection against SCA attacksand enables the detection of fault injection attempts. In particular, we present howreconfigurable secure gadgets can be realized employing a reformed variant of theLUT-based Masked Dual-Rail with Pre-charge Logic (LMDPL) hardware maskingscheme and a modified version of Wave Dynamic Differential Logic ( WDDL) tobe composed into a fabric. We also show how any basic Hardware DescriptionLanguage ( HDL) design is automatically mapped to the primitives of our fabric,embedding provable hardware security, and bypassing the necessity for hardwaresecurity proficiency in this process. It is worth mentioning that our fabric requiresapproximately 85% less area to map a secure design compared to conventional FieldProgrammable Gate Arrays ( FPGAs). A practical security evaluation of our securefabric implementation on a real FPGA target board, using Test Vector LeakageAssessment (TVLA), demonstrated no SCA leakage over 100 million traces.
2025
TCHES
Designing a General-Purpose 8-bit (T)FHE Processor Abstraction
Making the most of TFHE programmable bootstrapping to evaluate functions or operators otherwise challenging to perform with only the native addition and multiplication of the scheme is a very active line of research. In this paper, we systematize this approach and apply it to build an 8-bit FHE processor abstraction, i.e., a software entity that works over FHE-encrypted 8-bit data and presents itself to the programmer by means of a conventional-looking assembly instruction set. In doing so, we provide several homomorphic LookUp Table (LUT) dereferencing operators based on variants of the tree-based method and show that they are the most efficient option for manipulating encryptions of 8-bit data (optimally represented as two basis 16 digits). We then systematically apply this approach over a set of around 50 instructions, including, notably, conditional assignments, divisions, or fixed-point arithmetic operations. We further test the approach on several simple algorithms, including the execution of a neuron with a sigmoid activation function with 16-bit precision. We conclude the paper by comparing our work to the FHE compilers available in the state of the art. Finally, this work reveals that a very limited set of functional bootstrapping patterns is versatile and efficient enough to achieve general-purpose FHE computations beyond the boolean circuit approach. As such, these patterns may be an appropriate target for further works on advanced software optimizations or hardware implementations.
2025
TCHES
FINAL bootstrap acceleration on FPGA using DSP-free constant-multiplier NTTs
This work showcases Quatorze-bis, a state-of-the-art Number Theoretic Transform circuit for TFHE-like cryptosystems on FPGAs. It contains a novel modular multiplication design for modular multiplication with a constant for a constant modulus. This modular multiplication design does not require any DSP units or any dedicated multiplier unit, nor does it require extra logic when compared to the state-of-the-art modular multipliers. Furthermore, we present an implementation of a constant multiplier Number Theoretic Transform design for TFHE-like schemes. Lastly, we use this Number Theoretic Transform design to implement a FINAL hardware accelerator for the AMD Alveo U55c which improves the Throughput metric of TFHE-like cryptosystems on FPGAs by a factor 9.28x over Li et al.’s NFP CHES 2024 accelerator and by 10-25% over the absolute state-of-the-art design FPT [vBDTV23] while using one third of FPTs DSPs.
2025
TCHES
Higher-Order Time Sharing Masking
At CHES 2024, Time Sharing Masking (TSM) was introduced as a novel low-latency masking technique for hardware circuits. TSM offers area and randomness efficiency, as well as glitch-extended PINI security, but it is limited to first-order security. We address this limitation and generalize TSM to higher-order security while maintaining all of TSM’s advantages. Additionally, we propose an area-latency tradeoff. We prove HO-TSM glitch-extended PINI security and successfully evaluate our circuits using formal verification tools. Furthermore, we demonstrate area- and latency-efficient implementations of the AES S-box, which do not exhibit leakage in TVLA on FPGA. Our proposed tradeoff enables a first-order secure implementation of a complete AES-128 encryption core with 92 kGE, 920 random bits per round, and 20 cycles of latency, which does not exhibit leakage in TVLA on FPGA.
2025
TCHES
HIPR: Hardware IP Protection through Low-Overhead Fine-Grain Redaction
Hardware intellectual property (IP) blocks have been subjected to various forms of confidentiality and integrity attacks in recent years due to the globalization of the semiconductor industry. System-on-chip (SoC) designers are now considering a zero-trust model for security, where an IP can be attacked at any stage of the manufacturing process for piracy, cloning, overproduction, or malicious alterations. Hardware redaction has emerged as a promising countermeasure to thwart confidentiality and integrity attacks by untrusted entities in the globally distributed supply chain. However, existing redaction techniques provide this security at high overhead costs, making them unsuitable for real-world implementation. In this paper, we propose HIPR, a fine-grain redaction methodology that is robust, scalable, and incurs significantly lower overhead compared to existing redaction techniques. HIPR redacts security-critical Boolean and sequential logic from the hardware design, performs interconnect randomization, and employs multiple overhead optimization steps to reduce overhead costs. We evaluate HIPR on open-source benchmarks and reduce area overheads by 1 to 2 orders of magnitude compared to state-of-the-art redaction techniques without compromising security. We also demonstrate that the redaction performed by HIPR is resilient against conventional functional and structural attacks on hardware IPs. The redacted test IPs used to evaluate HIPR are available at: https://github.com/UF-Nelms-IoT-Git-Projects/HIPR.
2025
TCHES
HRaccoon: A High-performance Configurable SCA Resilient Raccoon Hardware Accelerator
The lattice-based Raccoon scheme is one of the candidates in Round 1 of the National Institute of Standards and Technology (NIST) post-quantum cryptography (PQC) additional digital signatures standardization process. As a scheme with built-in masking features, Raccoon is also a viable candidate for NIST’s Masking Circuit and Threshold Cryptography project. Current Raccoon implementations are limited to software or software-hardware co-designs only and consequently lacking in terms of high throughput performance that hardware implementations can generally promise. To achieve this, we are the first to propose a configurable and high-performance pure hardware architecture for Raccoon. The proposed FPGA architecture features extensive optimizations in key modules for Raccoon such as the modular reduction, polynomial operations, and sampling. The segmentation and loop-based scheduling scheme interacts with the defined BRAM-based memory access pattern to ensure efficient and coherent data flow under the three security levels and two masking modes (non- and first-order masking). Implementation results of Raccoon on an AMD Artix- 7 FPGA device show that our proposed architecture achieves a 1.4–2.1x speedup compared to software implementations and a 20–42x speedup compared to softwarehardware co-designs for the three security levels, despite its hardware area being comparable to that of the lightweight CRYSTALS-Dilithium architecture. Finally, a TVLA test is demonstrated on Raccoon-128 with non-masking and first-order masking to evaluate its resilience to side-channel attacks.
2025
TCHES
Improving MPCitH with Preprocessing: Mask Is All You Need
The MPC-in-the-head with preprocessing (MPCitH-PP) paradigm presents a novel approach for constructing post-quantum digital signatures like Picnic3. This paper revisits the MPCitH-PP construction, analyzing both its offline and online phases and proposing a reformulation of the protocol. By identifying redundant computations in these phases, we optimize them into a single phase, thereby enhancing the efficiency of MPCitH-PP. Furthermore, we explore the independence of the mask, demonstrating that it can be calculated in parallel, which also enables the optimization of the masked witness calculation.Our optimized implementation of Picnic3 shows significant improvements. At the L1 security level, the optimal software implementation reduces MPCitH-PP calculation time to about 30% of the previous implementation. The optimal signature implementation costs about 78% of the previous implementation time. At the L5 security level, MPCitH-PP with parallelism optimal is reduced to about 26% of the previous solution’s time, and the optimal signature implementation runs at about 53% of the previous solution’s time. For the hardware implementation, our optimizations reduce the clock cycles of MPCitH-PP from r sequential rounds to a single parallel round, where r denotes the number of rounds in the LowMC algorithm, with little change in hardware usage, and perform better in AT product, especially for parallel computing.
2025
TCHES
Information Theoretic Analysis of PUF-Based Tamper Protection
PUFs enable physical tamper protection for high-assurance devices without needing a continuous power supply that is active over the entire lifetime of the device. Several methods for PUF-based tamper protection have been proposed together with practical quantization and error correction schemes. In this work we take a step back from the implementation to analyze theoretical properties and limits. We apply zero leakage output quantization to existing quantization schemes and minimize the reconstruction error probability under zero leakage. We apply wiretap coding within a helper data algorithm to enable a reliable key reconstruction for the legitimate user while guaranteeing a selectable reconstruction complexity for an attacker, analogously to the security level for a cryptographic algorithm for the attacker models considered in this work. We present lower bounds on the achievable key rates depending on the attacker’s capabilities in the asymptotic and finite blocklength regime to give fundamental security guarantees even if the attacker gets partial information about the PUF response and the helper data. Furthermore, we present converse bounds on the number of PUF cells. Our results show for example that for a practical scenario one needs at least 459 PUF cells using 3 bit quantization to achieve a security level of 128 bit.
2025
TCHES
KeyVisor – A Lightweight ISA Extension for Protected Key Handles with CPU-enforced Usage Policies
The confidentiality of cryptographic keys is essential for the security of protection schemes used for communication, file encryption, and outsourced compu- tation. Beyond cryptanalytic attacks, adversaries can steal keys from memory via software exploits or side channels, enabling them to, e.g., manipulate confidential information or impersonate key owners. Therefore, existing defenses protect keys in dedicated devices or isolated memory, or store them only in encrypted form. However, these designs often provide unfavorable tradeoffs, sacrificing performance, fine-grained access control, or deployability.In this paper, we present KeyVisor, a lightweight Instruction Set Architecture ( ISA) extension that securely offloads the handling of symmetric crypto keys to the CPU. KeyVisor provides CPU instructions that enable applications to request protected key handles and perform AEAD cipher operations on them. The underlying keys are accessible only by KeyVisor, and thus never leak to memory. KeyVisor’s direct CPU integration enables fast crypto operations and hardware-enforced key usage restrictions, e.g., keys usable only for de-/encryption, with a limited lifetime, or with a process binding. Furthermore, privileged software, e.g., the monitor firmware of TEEs, can revoke keys or bind them to a specific process/TEE. We implement KeyVisor for RISC-V based on RocketChip, evaluate its performance, and demonstrate real-world use cases, including key-value databases, automotive feature licensing, and a read-only network middlebox.
2025
TCHES
KyberSlash: Exploiting secret-dependent division timings in Kyber implementations
This paper presents KyberSlash1 and KyberSlash2 – two timing vulnerabilities in several implementations (including the official reference code) of the Kyber Post-Quantum Key Encapsulation Mechanism, recently standardized as ML-KEM. We demonstrate the exploitability of both KyberSlash1 and KyberSlash2 on two popular platforms: the Raspberry Pi 2 (Arm Cortex-A7) and the Arm Cortex-M4 microprocessor. Kyber secret keys are reliably recovered within minutes for KyberSlash2 and a few hours for KyberSlash1. We responsibly disclosed these vulnerabilities to maintainers of various libraries and they have swiftly been patched. We present two approaches for detecting and avoiding similar vulnerabilities. First, we patch the dynamic analysis tool Valgrind to allow detection of variable-time instructions operating on secret data, and apply it to more than 1000 implementations of cryptographic primitives in SUPERCOP. We report multiple findings. Second, we propose a more rigid approach to guarantee the absence of variable-time instructions in cryptographic software using formal methods.
2025
TCHES
Leading Degree: A Metric for Model Performance Evaluation and Hyperparameter Tuning in Deep Learning-Based Side-Channel Analysis
Side-channel analysis benefits a lot from deep learning techniques, which assist attackers in recovering the secret key with fewer attack traces than before, but it remains a problem to precisely measure deep learning model performance, so as to obtain a high-performance model. Commonly used evaluation metrics for deep learning like accuracy and precision cannot well meet the demand due to their deviation in side-channel analysis, and classical evaluation metrics for side-channel analysis like guessing entropy, success rate and TGE1 are not generic because they effectively evaluate model performance in only one of the two situations: whether models manage to recover the secret key with given attack traces or not, and not efficient because they need to be performed multiple times to counteract randomness. To attain an effective generic side-channel evaluation metric, we investigate the deterministic component of power consumption, find that the elements of score vector under a key follow a linearly transformed chi-square distribution approximately, and some wrong key hypotheses usually with top scores provide great assistance in model performance evaluation, and finally we propose a new metric called Leading Degree (LD) as well as its simplified version LD-simplified for measuring model performance, which offers similar accuracy but much better generality and efficiency compared with the classical side-channel benchmark metric TGE1, and offers similar generality and efficiency but significantly better accuracy compared with recently proposed sidechannel metrics like Label Correlation and Cross Entropy Ratio. LD/LD-simplified can be easily deployed in early stopping to avoid overfitting phenomena, and we build a bridge between LD/LD-simplified and TGE1, by observing an exponential relationship, which significantly shortens the estimating time for TGE1. At last, we apply LD as a reward function to better solve the reward function design problem in reinforcement learning-based model hyperparameter tuning of side-channel analysis, and obtain better CNN model architectures compared with the state-of-the-art models obtained by previous hyperparameter tuning methods.
2025
TCHES
Leaky McEliece: Secret Key Recovery From Highly Erroneous Side-Channel Information
The McEliece cryptosystem is a strong contender for post-quantum schemes, including key encapsulation for confidentiality of key exchanges in network protocols. A McEliece secret key is a structured parity check matrix that is transformed via Gaussian elimination into an unstructured public key. We show that this transformation is highly critical with respect to side-channel leakage. We assume leakage of the elementary row operations during Gaussian elimination, motivated by McEliece implementations in the cryptographic libraries Classic McEliece and Botan.We propose a novel decoding algorithm to reconstruct a secret key from its public key with information from a Gaussian transformation leak. Even if the obtained side-channel leakage is extremely noisy, i.e., each bit is flipped with probability as high as r ≈ 0.4, we succeed to recover the secret key in a matter of minutes for all proposed (Classic) McEliece instantiations. Remarkably, for high-security McEliece parameters, our attack is more powerful in the sense that it can tolerate even larger r . We demonstrate our attack on the constant-time reference implementation of Classic McEliece in a single-trace setting, using an STM32L592 ARM processor.Our result stresses the necessity of properly protecting highly structured code-based schemes such as McEliece against side-channel leakage.
2025
TCHES
Let’s DOIT: Using Intel’s Extended HW/SW Contract for Secure Compilation of Crypto Code
It is a widely accepted standard practice to implement cryptographic software so that secret inputs do not influence the cycle count. Software following this paradigm is often referred to as “constant-time” software and typically involves following three rules: 1) never branch on a secret-dependent condition, 2) never access memory at a secret-dependent location, and 3) avoid variable-time arithmetic operations on secret data. The third rule requires knowledge about such variable-time arithmetic instructions, or vice versa, which operations are safe to use on secret inputs. For a long time, this knowledge was based on either documentation or microbenchmarks, but critically, there were never any guarantees for future microarchitectures. This changed with the introduction of the data-operand-independent-timing (DOIT) mode on Intel CPUs and, to some extent, the data-independent-timing (DIT) mode on Arm CPUs. Both Intel and Arm document a subset of their respective instruction sets that are intended to leak no information about their inputs through timing, even on future microarchitectures if the CPU is set to run in a dedicated DOIT (or DIT) mode.In this paper, we present a principled solution that leverages DOIT to enable cryptographic software that is future-proof constant-time, in the sense that it ensures that only instructions from the DOIT subset are used to operate on secret data, even during speculative execution after a mispredicted branch or function return location. For this solution, we build on top of existing security type systems in the Jasmin framework for high-assurance cryptography.We then use our solution to evaluate the extent to which existing cryptographic software built to be “constant-time” is already secure in this stricter paradigm implied by DOIT and what the performance impact is to move from constant-time to future-proof constant-time.
2025
TCHES
MulLeak: Exploiting Multiply Instruction Leakage to Attack the Stack-optimized Kyber Implementation on Cortex-M4
CRYSTALS-Kyber, one of the NIST PQC standardization schemes, has garnered considerable attention from researchers in recent years for its side-channel security. Various targets have been explored in previous studies; however, research on extracting secret information from stack-optimized implementations targeting the Cortex-M4 remains scarce, primarily due to the lack of memory access operations, which increases the difficulty of attacks.This paper shifts the focus to the leakage of multiply instructions and present a novel cycle-level regression-based leakage model for the following attacks. We target the polynomial multiplications in decryption process of the stack-optimized implementation targeting the Cortex-M4, and propose two regression-based profiled attacks leveraging known ciphertext and chosen ciphertext methodologies to recover the secret coefficients individually. The later one can also be extended to the protected implementation.Our practical evaluation, conducted on the stack-optimized Kyber-768 implementation from the pqm4 repository, demonstrates the effectiveness of the proposed attacks. Focusing on the leakage from the pair-pointwise multiplication, specifically the macro doublebasemul_frombytes_asm, we successfully recover all secret coefficients with a success rate exceeding 95% using a modest number of traces for each attack. This research underscores the potential vulnerabilities in PQC implementations against side-channel attacks and contributes to the ongoing discourse on the physical security of cryptographic algorithms.
2025
TCHES
New Quantum Cryptanalysis of Binary Elliptic Curves
This paper improves upon the quantum circuits required for the Shor’s attack on binary elliptic curves. We present two types of quantum point addition, taking both qubit count and circuit depth into consideration.In summary, we propose an in-place point addition that improves upon the work of Banegas et al. from CHES’21, reducing the qubit count – depth product by more than 73% – 81% depending on the variant. Furthermore, we develop an out-of-place point addition by using additional qubits. This method achieves the lowest circuit depth and offers an improvement of over 92% in the qubit count – quantum depth product (for a single step).To the best of our knowledge, our work improves from all previous works (including the CHES’21 paper by Banegas et al., the IEEE Access’22 paper by Putranto et al., and the CT-RSA’23 paper by Taguchi and Takayasu) in terms of circuit depth and qubit count – depth product.Equipped with the implementations, we discuss the post-quantum security of the binary elliptic curve cryptography. Under the MAXDEPTH metric (proposed by the US government’s NIST), the quantum circuit with the highest depth in our work is 224, which is significantly lower than the MAXDEPTH limit of 240. For the gate count – full depth product, a metric for estimating quantum attack cost (proposed by NIST), the highest complexity in our work is 260 for the curve having degree 571 (which is comparable to AES-256 in terms of classical security), considerably below the post-quantum security level 1 threshold (of the order of 2156).
2025
TCHES
On the Average Random Probing Model
Masking is one of the main countermeasures against side-channel analysis since it relies on provable security. In this context, “provable” means that a security bound can be exhibited for the masked implementation through a theoretical analysis in a given threat model. The main goal in this line of research is therefore to provide the tightest security bound, in the most realistic model, in the most generic way. Yet, all of these objectives cannot be reached together. That is why the masking literature has introduced a large spectrum of threat models and reductions between them, depending on the desired trade-off with respect to these three goals. In this paper, we focus on three threat models, namely the noisy-leakage model (realistic yet hard to work with), the random probing (unrealistic yet easy to work with), and more particularly a third intermediate model called average random probing. Average random probing has been introduced by Dziembowski et al. at Eurocrypt 2015, in order to exhibit a tight reduction between noisy-leakage and random probing models, recently proven by Brian et al. at Eurocrypt 2024. This milestone has strong practical consequences, since otherwise the reduction from the noisy leakage model to the random probing model introduces a prohibitively high constant factor in the security bound, preventing security evaluators to use it in practice. However, we exhibit a gap between the average random probing definitions of Dziembowski et al. (denoted hereafter by DFS-ARP) and Brian et al. (simply denoted by ARP). Whereas any noisy leakage can be tightly reduced to DFS-ARP, we show in this paper that it cannot be tightly reduced to ARP, unless requiring extra assumptions, e.g., if the noisy leakage is deterministic. Our proof techniques do not involve more tools than the one used so far in such reductions, namely basic probability facts, and known properties of the total variation distance. As a consequence, the reduction from the noisy leakage to the random probing — without high constant factor — remains unproven. This stresses the need to clarify the practical relevance of analyzing the security of masking in the random probing model since most of the current efforts towards improving the constructions and their security proofs in the random probing model might be hindered by potentially unavoidable loss in the reduction from more realistic but currently less investigated leakage models.
2025
TCHES
On the Characterization of Phase Noise for the Robust and Resilient PLL-TRNG Design
A true random number generator (TRNG) is a critical component in ensuring the security of cryptographic systems. Among TRNG implementations, the phase-locked loop-based TRNG (PLL-TRNG) is a widely adopted solution for FPGA platforms due to the availability of a stochastic model. In the previous study, this stochastic model was based on analog noise signals, which potentially led to an oversimplification of the PLL physical process and resulted in an overestimation of entropy. To address this limitation, we extract key platform-specific parameters of the PLL and develop a new stochastic model tailored for multi-output PLL-TRNGs. For the first time, we reveal the effect of the PLL’s bandwidth on the correlation of sampling points and introduce a method for quantitatively controlling sampling point correlations. Finally, we validate the model through on-chip jitter measurements. Experimental results show that the proposed stochastic model accurately describes the behavior of the PLL-TRNG and provides the most conservative entropy lower bound, with a 1.8-fold improvement in jitter resolution.
2025
TCHES
Optimal Dimensionality Reduction using Conditional Variational AutoEncoder
The benefits of using Deep Learning techniques to enhance side-channel attacks performances have been demonstrated over recent years. Most of the work carried out since then focuses on discriminative models. However, one of their major limitations is the lack of theoretical results. Indeed, this lack of theoretical results, especially concerning the choice of neural network architecture to consider or the loss to prioritize to build an optimal model, can be problematic for both attackers and evaluators. Recently, Zaid et al. addressed this problem by proposing a generative model that bridges conventional profiled attacks and deep learning techniques, thus providing a model that is both explicable and interpretable. Nevertheless the proposed model has several limitations. Indeed, the architecture is too complex, higher-order attacks cannot be mounted and desynchronization is not handled by this model. In this paper, we address the first limitation namely the architecture complexity, as without a simpler model, the other limitations cannot be treated properly. To do so, we propose a new generative model that relies on solid theoretical results. This model is based on conditional variational autoencoder and converges towards the optimal statistical model i.e. it performs an optimal attack. By building on and extending the state-of-the-art theoretical works on dimensionality reduction, we integrate into this neural network an optimal dimensionality reduction i.e. a dimensionality reduction that is achieved without any loss of information. This results in a gain of O(D), with D the dimension of traces, compared to Zaid et al. neural network in terms of architecture complexity, while at the same time enhancing the explainability and interpretability. In addition, we propose a new attack strategy based on our neural network, which reduces the attack complexity of generative models from O(N) to O(1), with N the number of generated traces. We validate all our theoretical results experimentally using extensive simulations and various publicly available datasets covering symmetric, asymmetric pre and post-quantum cryptography implementations.
2025
TCHES
OPTIMSM: FPGA hardware accelerator for Zero-Knowledge MSM
The Multi-Scalar Multiplication (MSM) is the main barrier to accelerating Zero-Knowledge applications. In recent years, hardware acceleration of this algorithm on both FPGA and GPU has become a popular research topic and the subject of a multi-million dollar prize competition (ZPrize). This work presents OPTIMSM: Optimized Processing Through Iterative Multi-Scalar Multiplication. This novel accelerator focuses on the acceleration of the MSM algorithm for any Elliptic Curve (EC) by improving upon the Pippenger algorithm. A new iteration technique is introduced to decouple the required buckets from the window size, resulting in fewer EC computations for the same on-chip memory resources. Furthermore, we combine known optimizations from the literature for the first time to achieve additional latency improvements. Our enhanced MSM implementation significantly reduces computation time, achieving a speedup of up to x12.77 compared to recent FPGA implementations. Specifically, for the BLS12-381 curve, we reduce the computation time for an MSM of size 224 to 914 ms using a single compute unit on the U55C FPGA or to 231 ms using four U55C devices. These results indicate a substantial improvement in efficiency, paving the way for more scalable and efficient Zero-Knowledge proof systems.
2025
TCHES
POTA: A Pipelined Oblivious Transfer Acceleration Architecture for Secure Multi-Party Computation
With the rapid development and deployment of machine learning (ML) and big data technologies, which rely heavily on sensitive user data for training and inference, ensuring privacy and data security has become a pressing challenge. Addressing this issue requires methods that safeguard sensitive information while maintaining the correctness of computational results. Secure multi-party computation (MPC), as a representative application of cryptographic techniques, offers a technical solution to this challenge by enabling privacy-preserving computations. It has been widely applied in scenarios such as cloud-based inference and other privacy-sensitive tasks. However, MPC also introduces significant performance overhead, thus limiting its further application. Our analysis reveals that the foundational element of MPC, the oblivious transfer (OT) protocol collectively account for up to 96.64% of the execution time. It is because the OT protocols are constrained by low network band- width and weak compute engines. To address these challenges, we propose POTA, a high-performance pipelined OT hardware acceleration architecture supporting the silent OT protocol. In the POTA design, we develop efficient subsystems targeting the two most compute-intensive parts: the construction of puncturable pseudoran- dom function (PPRF), and large matrix-vector multiplications under the learning parity with noise (LPN) assumption within the silent OT protocol. In addition, to address the performance overhead caused by data transfer between POTA and the host CPU, we design a host-accelerator execution pipeline to hide the considerable transmission latency. Furthermore, we design a modular multiplication module over a finite field to generate the more complex correlations required by MPC protocols. Finally, we implement a POTA prototype on Xilinx VCU129 FPGAs. Experimental results demonstrate that under various network settings, POTA achieves significant speedups, with maximum improvements of 192.57x for basic operations and 597.57x for convolutional neural networks (CNN).
2025
TCHES
Practical Opcode-based Fault Attack on AES-NI
AES New Instructions (AES-NI) is a set of hardware instructions introduced by Intel to accelerate AES encryption and decryption, significantly improving efficiency across various cryptographic applications. While AES-NI effectively mitigates certain side-channel attacks, its resilience against faults induced by active or malicious fault injection remains unclear.In this paper, we conduct a comprehensive security analysis of AES-NI. By analyzing the opcodes of AES-NI, we identify six pairs of instructions with only a single-bit difference, making them susceptible to bit-flip-type attacks. This vulnerability allows attackers to recover AES keys in both Electronic Codebook (ECB) and Cipher Block Chaining (CBC) modes. We introduce a novel Opcode-based Fault Analysis (OFA) method, employing Gaussian elimination to reduce the search space of the last round key. In particular, with one pair of correct and faulty ciphertexts, OFA can reduce the key search space to 232 for a 128-bit key length. To further reduce the key space for AES-192 and AES-256, we propose the Enhanced Opcode-based Fault Analysis (EOFA), which, compared to exhaustive search, reduces the key space by factors of 2160 and 2192, respectively.Finally, we demonstrate the feasibility of our findings by conducting physical endto- end attacks. Specifically, Rowhammer is leveraged to flip vulnerable opcodes and OFA as well as EOFA techniques are applied to recover secret keys from AES implementations. Our experimental results for AES-ECB-128, AES-ECB-192, and AES-CBC-128 demonstrate that key recovery can be efficiently achieved within 1.03 to 1.36 hours, varying with the cipher. This work highlights a critical vulnerability in AES-NI and outlines a new and novel pathway for fault-based attacks against modern cryptographic implementations.
2025
TCHES
Primitive-Level vs. Implementation-Level DPA Security: a Certified Case Study: (Pleading for Standardized Leakage-Resilient Cryptography)
Implementation-level countermeasures like masking can be applied to any cryptographic algorithm in order to mitigate Differential Power Analysis (DPA). Leveraging re-keying with a Leakage-Resilient PRF (LR-PRF) is an alternative countermeasure that requires a change of primitive. Both options rely on different security mechanisms: signal-to-noise ratio amplification for masking, signal reduction for LRPRFs. This makes their general comparison difficult and suggests the investigation of relevant case studies to identify when to use one or the other as an interesting research direction. In this paper, we provide such a case study and compare the security that can be obtained by using an unprotected hardware coprocessor, to be integrated into a leakage-resilient PRF, and a certified one, protected with implementation-level countermeasures. Both are available on “commercial off-the-shelf” devices and could be used for lightweight IoT applications. We first perform an in-depth analysis of these targets. It allows us to put forward the different evaluation challenges that they raise, and the similar to slightly better cost vs. security tradeoff that the leakage-resilient PRF offers in our experiments. We then discuss the advantages and limitations of both types of countermeasures. While there are contexts where the higher flexibility of masking is needed, we conclude that there are also applications that would strongly benefit from the simplicity of the LR-PRF’s design and evaluation. Positing that the lack of standards is the main impediment to their more widespread deployment, we therefore hope that our results can motivate such standardization efforts.
2025
TCHES
Protection of Oscillator-Based PUFs against Side Channel Analyses by Random Interruption
Oscillation-based physical unclonable functions (PUFs) are known to be sensitive to power trace side channel analyses (SCAs). Although previous work investigated on countermeasures, these required significant additional amount of hardware or were just able to obscure sign information of a frequency comparison, while the magnitude information remains available to the attacker. As recent innovation on oscillation-based PUFs also require the magnitude-information beside the sign, e.g., to increase the reliability, the need arises to protect both. We present a new protection approach to hide both sign and magnitude information of oscillation-based PUF from an attacker. By introducing random interruptions in the oscillation, the power spectrum is blurred while the quality of the PUF is maintained. In addition to concept simulations and the discussion of different implementations, we use the example of a loop-PUF to show that the presented countermeasure can withstand several attack scenarios.
2025
TCHES
REED: Chiplet-based Accelerator for Fully Homomorphic Encryption
Fully Homomorphic Encryption (FHE) enables privacy-preserving computation and has many applications. However, its practical implementation faces massive computation and memory overheads. To address this bottleneck, several Application-Specific Integrated Circuit (ASIC) FHE accelerators have been proposed. All these prior works put every component needed for FHE onto one chip (monolithic), hence offering high performance. However, they encounter common challenges associated with large-scale chip design, such as inflexibility, low yield, and high manufacturing costs. In this paper, we present the first-of-its-kind multi-chiplet-based FHE accelerator ‘REED’ for overcoming the limitations of prior monolithic designs. To utilize the advantages of multi-chiplet structures while matching the performance of larger monolithic systems, we propose and implement several novel strategies in the context of FHE. These include a scalable chiplet design approach, an effective framework for workload distribution, a custom inter-chiplet communication strategy, and advanced pipelined Number Theoretic Transform and automorphism design to enhance performance.Our instruction-set and power simulations experiments with a prelayout netlist indicate that REED 2.5D microprocessor consumes 96.7mm2 chip area, 49.4Waverage power in 7nm technology. It could achieve a remarkable speedup of up to 2,991x compared to a CPU (24-core 2xIntel X5690) and offer 1.9x better performance, along with a 50% reduction in development costs when compared to state-of-the-art ASIC FHE accelerators. Furthermore, our work presents the first instance of benchmarking an encrypted deep neural network (DNN) training. Overall, the REED architecture design offers a highly effective solution for accelerating FHE, thereby significantly advancing the practicality and deployability of FHE in real-world applications.
2025
TCHES
Rudraksh: A compact and lightweight post-quantum key-encapsulation mechanism
Resource-constrained devices such as wireless sensors and Internet of Things (IoT) devices have become ubiquitous in our digital ecosystem. These devices generate and handle a major part of our digital data. However, due to the impending threat of quantum computers on our existing public-key cryptographic schemes and the limited resources available on IoT devices, it is important to design lightweight post-quantum cryptographic (PQC) schemes suitable for these devices.In this work, we explored the design space of learning with error-based PQC schemes to design a lightweight key-encapsulation mechanism (KEM) suitable for resourceconstrained devices. We have done a scrupulous and extensive analysis and evaluation of different design elements, such as polynomial size, field modulus structure, reduction algorithm, and secret and error distribution of an LWE-based KEM. Our explorations led to the proposal of a lightweight PQC-KEM, Rudraksh, without compromising security. Our scheme provides security against chosen ciphertext attacks (CCA) with more than 100 bits of Core-SVP post-quantum security and belongs to the NIST-level-I security category (provide security at least as much as AES-128). We have also shown how ASCON can be used for lightweight pseudo-random number generation and hash function in the lattice-based KEMs instead of the widely used Keccak for lightweight design. Our FPGA results show that Rudraksh currently requires the least area among the PQC KEMs of similar security. Our implementation of Rudraksh provides a ~3x improvement in terms of the area requirement compared to the state-of-the-art areaoptimized implementation of Kyber, can operate at 63%-76% higher frequency with respect to high-throughput Kyber, and improves time-area-product ~2x compared to the state-of-the-art compact implementation of Kyber published in HPEC 2022.
2025
TCHES
Scoop: An Optimization Algorithm for Profiling Attacks against Higher-Order Masking
In this paper we provide new theoretical and empirical evidences that gradient-based deep learning profiling attacks (DL-SCA) suffer from masking schemes. This occurs through an initial stall of the learning process: the so-called plateau effect. To understand why, we derive an analytical expression of a DL-SCA model targeting simulated traces which enables us to study an analytical expression of the loss. By studying the loss landscape of this model, we show that not only do the magnitudes of the gradients decrease as the order of masking increases, but the loss landscape also exhibits a prominent saddle point interfering with the optimization process. From these observations, we (1) propose the usage of a second-order optimization algorithm mitigating the impact of low-gradient areas. In addition, we show how to leverage the intrinsic sparsity of valuable information in SCA traces to better pose the DL-SCA problem. To do so, we (2) propose to use the implicit regularization properties of the sparse mirror descent. These propositions are gathered in a new publicly available optimization algorithm, Scoop. Scoop combines second-order derivative of the loss function in the optimization process, with a sparse stochastic mirror descent. We experimentally show that Scoop pushes further the current limitations of DL-SCA against simulated traces, and outperforms the state-of-theart on the ASCADv1 dataset in terms of number of traces required to retrieve the key, perceived information and plateau length. Scoop also performs the first nonworst- case attack on the ASCADv2 dataset. On simulated traces, we show that using Scoop reduces the DL-SCA time complexity by the equivalent of one masking order.
2025
TCHES
SeaFlame: Communication-Efficient Secure Aggregation for Federated Learning against Malicious Entities
Secure aggregation is a popular solution to ensuring privacy for federated learning. However, when considering malicious participants in secure aggregation, it is difficult to achieve both privacy and high efficiency. Therefore, we propose SeaFlame, a communication-efficient secure aggregation protocol against malicious participants. Inspired by the state-of-the-art work, ELSA, SeaFlame also utilizes two non-colluding servers to ensure privacy against malicious entities and provide defenses against boosted gradients. Crucially, to improve communication efficiency, SeaFlame uses arithmetic sharing together with arithmetic-to-arithmetic share conversion to reduce client communication, and uses the random linear combination to reduce server communication.Security analysis proves that our SeaFlame guarantees privacy against malicious clients colluding with one malicious server. Experimental evaluation demonstrates that, compared to ELSA, SeaFlame optimizes communication by up to 10.5, 6.00, and 8.17 times for a client, a server, and all entities, at the expense of 1.25-1.86 times additional end-to-end runtime.
2025
TCHES
Secure and efficient transciphering for FHE-based MPC
Transciphering (or Hybrid-Homomorphic Encryption, HHE) is an established technique for avoiding ciphertext expansion in HE applications, saving communication and storage resources. Recently, it has also been shown to be a fundamental component in the practical construction of HE-based multi-party computation (MPC) protocols, being used both for input data and intermediary results (Smart, IMACC 2023). In these protocols, however, ciphers are used with keys that are jointly generated by multiple (possibly malicious) parties, which may require additional security assumptions that have been so far overlooked in the HHE literature. In this paper, we formalize this issue as a security against related-key attacks (RKA) problem and provide efficient solutions for it. We start by presenting an efficient method for homomorphically evaluating Mixed-Filter-Permutator (MFP) ciphers in leveled mode, enabling speedups of up to thousands of times compared to previous literature. For the multi-party scenario, we focus specifically on the Margrethe cipher (Hoffmann et al., INDOCRYPT 2023). We show that, contrary to other commonly used HHE ciphers (e.g. FLIP), Margrethe is out-of-the-box secure for any protocols that allow malicious parties to learn up to two related key streams, enabling security for the vast majority of static MPC protocols. For other cases, we quantify the loss of security based on the number of related key streams (which often depends on the number of malicious parties and specific protocol). Performance-wise, our implementation of Margrethe takes just 3.9 ms to transcipher 4-bit messages, being significantly faster than the state of the art in terms of latency.
2025
TCHES
Shortcut2Secrets: A Table-based Differential Fault Attack Framework
Recently, Differential Fault Attacks (DFAs) have proven highly effective against stream ciphers designed for Hybrid Homomorphic Encryption (HHE). In this work, we present a table-based DFA framework called the shortcut attack, which generalizes the attack proposed by Wang and Tang on the cipher Elisabeth. The framework applies to a broad sub-family of ciphers following the Group Filter Permutator (GFP) paradigm and enhances previous DFAs by improving both the fault identification and path generation steps. Notably, the shortcut attack circumvents the issue of function representation, allowing successful attacks even when the cipher’s filter function cannot be represented over the ring it is defined on.Additionally, we provide complexity estimates for the framework and apply the shortcut attack to Elisabeth-4 and its patches. As a result, we optimize the DFA on Elisabeth-4, requiring fewer keystreams and running faster than previous methods. Specifically, we achieve a DFA that requires only 3000 keystreams, which is one-fifth of the previous best result. We also successfully mount a practical DFA on Gabriel-4 and provide a theoretical DFA for Elisabeth-b4.For the latest patch, Margrethe-18-4, which follows the more general Mixed Filter Permutator (MFP) paradigm, we present a DFA in a stronger model. To the best of our knowledge, these are the first DFA results on the patches of Elisabeth-4. Finally, we derive security margins to prevent shortcut attacks on a broad sub-family of MFP ciphers, which can serve as parameter recommendations for designers.
2025
TCHES
Sieving with Streaming Memory Access
We implement an optimized BGJ (Becker–Gama–Joux 2015) sieve and analyze its behavior in a study of RAM access overheads (and their minimization) in sieving algorithms for large lattice problems. Both experiment and theory points to BGJ’s inherent structure being much more memory-efficient than the BDGL (Becker–Ducas– Gama–Laahoven 2016) sieve, which uses asymptotically the fewest logical operations. In particular, a dimension-n BGJ sieve uses only 20.2075n+o(n) streaming (non-random) main memory accesses. A key insight: Bucket sizes decrease by orders of magnitude after each BGJ filtering layer, so that sub-buckets fit into successively much smaller (hence faster) storage areas. Our refined BGJ is competitive at cryptographic sizes and should outperform BDGL for all practically achievable dimensions.The above is corroborated by the results from our efficient CPU-based BGJ implementation in an optimized framework, which saves about 40% RAM footprint and is ≥ 24.5x more efficient gate-count-wise compared to the Ducas–Stevens–van Woerden 2021 4-GPU implementation, which like most prior sieving-based SVP computations is a HK3 (Herold–Kirshanova 2017) sieve. Notably, we solved the 183-dimensional SVP Darmstadt Challenge in 30 days on a 112-core server and 0.87 TB of RAM; similarly we also found a short vector in the 796-dimensional Ideal-SVP Challenge. Our implementation may offer further insights into the behavior of asymptotically “fast” sieving algorithms when applied to large-scale problems. Moreover, our refined cost estimation of SVP based on this implementation suggests that some NIST PQC candidates (e.g. Falcon-512), are not sure to meet NIST’s security requirements.
2025
TCHES
SimdMSM: SIMD-accelerated Multi-Scalar Multiplication Framework for zkSNARKs
Multi-scalar multiplication (MSM) is the primary building block in many pairing-based zero-knowledge proof (ZKP) systems. MSM at large scales has become the main bottleneck in ZKP implementations. Inspired by existing SIMD-accelerated work, we are focused on accelerating MSM computing efficiency using SIMD instructions in a single CPU environment. First, we propose a SIMD-accelerated MSM computing architecture with no write conflicts and constant memory overheads. This architecture utilizes multithreading to achieve task-level and loop-level parallelism and employs a three-tier buffer mechanism to maximize the utilization of the SIMD engine. Instanced with AVX512-IFMA instructions, we implement six SIMD elliptic curve arithmetic engines for different point addition in three coordinate systems and two groups. Moreover, we integrate our AVX-MSM implementation into the libsnark library, naming it AVX-ZK. In more detail, point deduplication and “Three-Stage” memory optimization are proposed to address problems existing in practical applications. Based on the RELIC library, our performance results on the BLS12-381 curve show that our AVX-MSM achieves up to 27.86x speedup over the most popular Pippenger algorithm. Compared with libsnark, our AVX-ZK implementation achieves over 11.53x (up to 20.26x) speedup under standard benchmarks.
2025
TCHES
Skyscraper: Fast Hashing on Big Primes
Arithmetic hash functions defined over prime fields have been actively developed and used in verifiable computation (VC) protocols. Among those, ellipticcurve- based SNARKs require large (256-bit and higher) primes. Such hash functions are notably slow, losing a factor of up to 1000 compared to regular constructions like SHA-2/3.In this paper, we present the hash function Skyscraper, which is aimed at large prime fields and provides major improvements compared to Reinforced Concrete and Monolith. First, the design is exactly the same for all large primes, which simplifies analysis and deployment. Secondly, it achieves a performance comparable to cryptographic hash standards by using low-degree non-invertible transformations and minimizing modulo reductions. Concretely, it hashes two 256-bit prime field (BLS12-381 curve scalar field) elements in 135 nanoseconds, whereas SHA-256 needs 42 nanoseconds on the same machine.The low circuit complexity of Skyscraper, together with its high native speed, should allow a substantial reduction in many VC scenarios, particularly in recursive proofs.
2025
TCHES
SoK: FHE-Friendly Symmetric Ciphers and Transciphering
Fully Homomorphic Encryption (FHE) enables computation on encrypted data without decryption, demonstrating significant potential for privacy-preserving applications. However, FHE faces several challenges, one of which is the significant plaintext-to-ciphertext expansion ratio, resulting in high communication overhead between client and server. The transciphering technique can effectively address this problem by first encrypting data with a space-efficient symmetric cipher, then converting symmetric ciphertext to FHE ciphertext without decryption.Numerous FHE-friendly symmetric ciphers and transciphering methods have been developed by researchers, each with unique advantages and limitations. These often require extensive knowledge of both symmetric cryptography and FHE to fully grasp, making comparison and selection among these schemes challenging. To address this, we conduct a comprehensive survey of over 20 FHE-friendly symmetric ciphers and transciphering methods, evaluating them based on criteria such as security level, efficiency, and compatibility. We have designed and executed experiments to benchmark the performance of the feasible combinations of symmetric ciphers and transciphering methods across various application scenarios. Our findings offer insights into achieving efficient transciphering tailored to different task contexts. Additionally, we make our example code available open-source, leveraging state-of-the-art FHE implementations.
2025
TCHES
Tailorable codes for lattice-based KEMs with applications to compact ML-KEM instantiations
Compared to elliptic curve cryptography, a primary drawback of latticebased schemes is the larger size of their public keys and ciphertexts. A common procedure for compressing these objects consists essentially of dropping some of their least significant bits. Albeit effective for compression, there is a limit to the number of bits to be dropped before we get a noticeable decryption failure rate (DFR), which is a security concern. To address this issue, this paper presents a family of error-correction codes that, by allowing an increased number of dropped bits while preserving a negligible DFR, can be used for both ciphertext and publickey compression in modern lattice-based schemes. To showcase the impact and practicality of our proposal, we use the highly optimized ML-KEM, a post-quantum lattice-based scheme recently standardized by NIST. We provide detailed procedures for tailoring our codes to ML-KEM’s specific noise distributions, and show how to analyze the DFR without independence assumptions on the noise coefficients. Among our results, we achieve between 4% and 8% ciphertext compression for MLKEM. Alternatively, we obtain 8% shorter public keys compared to the current standard. We also present isochronous implementations of the decoding procedure, achieving negligible performance impact in the full ML-KEM decapsulation even when considering optimized implementations for AVX2, Cortex-M4, and Cortex-A53.
2025
TCHES
TFHE Gets Real: an Efficient and Flexible Homomorphic Floating-Point Arithmetic
Floating-point arithmetic plays a central role in computer science and is used in various domains where precision and computational scale are essential. One notable application is in machine learning, where Fully Homomorphic Encryption (FHE) can play a crucial role in safeguarding user privacy. In this paper, we focus on TFHE and develop novel homomorphic operators designed to enable the construction of precise and adaptable homomorphic floating-point operations. Integrating floating-point arithmetic within the context of FHE is particularly challenging due to constraints such as small message space and the lack of information during computation. Despite these challenges, we were able to determine parameters for common precisions (e.g., 32-bit, 64-bit) and achieve remarkable computational speeds, with 32-bit floating-point additions completing in 2.5 seconds and multiplications in approximately 1 second in a multi-threaded environment. These metrics provide empirical evidence of the efficiency and practicality of our proposed methods, which significantly outperform previous efforts. Our results demonstrate a significant advancement in the practical application of FHE, making it more viable for real-world scenarios and bridging the gap between theoretical encryption techniques and practical usability.
2025
TCHES
ToFA: Towards Fault Analysis of GIFT and GIFT-like Ciphers Leveraging Truncated Impossible Differentials
In this work, we introduce ToFA, the first fault attack (FA) strategy that attempts to leverage the classically well-known idea of impossible differential cryptanalysis to mount practically verifiable attacks on bit-oriented ciphers like GIFT and BAKSHEESH. The idea stems from the fact that truncated differential paths induced due to fault injection in certain intermediate rounds of the ciphers lead to active SBox-es in subsequent rounds whose inputs admit specific truncated differences. This leads to a (multi-round) impossible differential distinguisher, which can be incrementally leveraged for key-guess elimination via partial decryption. The key-space reduction further exploits the multi-round impossibility, capitalizing on the relations due to the quotient-remainder (QR) groups of the GIFT and BAKSHEESH linear layer, which increases the filtering capability of the distinguisher. Moreover, the primary observations made in this work are independent of the actual SBox. Clock glitch based fault attacks were mounted on 8-bit implementations of GIFT- 64/GIFT-128 using a ChipWhisperer Lite board on an 8-bit ATXmega128D4-AU micro-controller. Unique key recovery was achieved for GIFT-128 with 3 random byte faults, while for GIFT-64, key space was reduced to 232, the highest achievable for GIFT-64, with a single level fault due to its key-schedule. To the best of our knowledge, this work also reports the highest fault injection penetration for any variant of GIFT and BAKSHEESH. Finally, this work reiterates the role of classical cryptanalysis strategies in fault vulnerability assessment by showcasing the most efficient fault attacks on GIFT.
2025
TCHES
TREE: Bridging the gap between reconfigurable computing and secure execution
Trusted Execution Environments (TEEs) have become a pivotal technology for securing a wide spectrum of security-sensitive applications. With modern computing systems shifting to heterogeneous architectures, integrating TEE support into these systems is paramount. One promising line of research has proposed leveraging FPGA technology to provide promising TEE solutions. Despite their potential, current implementations of FPGA-based TEEs have a set of drawbacks. Some solutions (i.e., MeetGo and ShEF) prioritize the secure loading of reconfigurable modules but lack compatibility with established legacy TEE specifications and services. On the other hand, those that aim to establish legacy compatibility (i.e., TEEOD and BYOTee) fail to fully utilize the dynamic reconfigurability and parallel processing capabilities inherent in FPGAs. In this context, we introduce Trusted Reconfigurable Execution Environments (TREE), a novel framework that fulfills the gaps existing in current FPGA-based TEE approaches. TREE enables system designers to fully leverage the reconfigurability capabilities of FPGAs without compromising compatibility with existing TEE specifications. Our reference TREE implementation ensures secure execution of user-customized hardware, legacy software trusted applications (TAs), and TAs that combine both custom hardware and software components, by fully exploiting the FPGA’s dynamic partial reconfiguration capabilities. TREE’s root of trust relies on conventional SoC-FPGA mechanisms including secure initial reconfiguration and memory protection, to ensure the initial bitstream integrity is kept after loaded and that reconfiguration access is restricted to the FPGA fabric after boot. Additionally, TREE provides essential TEE services within the FPGA fabric, including secure storage and cryptographic functions, enabling TAs to securely store sensitive data and perform critical operations in an isolated environment. Our evaluation on an entry-level FPGA, involved assessing TREE using microbenchmarks and real-world applications to compare its hardware costs and performance speedups against OP-TEE. The results showed that TREE’s hardware costs are minimal, while it achieves significant performance speedups, especially when compared to hardware TAs. For empirical demonstrations, we assess two real-world TA examples on TREE: an access control authenticator and a Bitcoin wallet.
2025
TCHES
VeloFHE: GPU Acceleration for FHEW and TFHE Bootstrapping
Bit-wise Fully Homomorphic Encryption schemes like FHEW and TFHE offer efficient functional bootstrapping, enabling concurrent function evaluation and noise reduction. While advantageous for secure computations, these schemes suffer from high data expansion, posing significant performance challenges in practical ap- plications due to massive ciphertexts. To address these issues, we propose VeloFHE, a CUDA-accelerated design to enhance the efficiency of FHEW and TFHE schemes on GPUs. We develop a novel hybrid four-step Number Theoretic Transform (NTT) approach for fast polynomial multiplication. By decomposing large-scale NTTs into highly parallelizable submodules, incorporating cyclic and negacyclic convolutions, and introducing several memory-oriented optimizations, we significantly reduce both the computational complexity and memory requirements. For blind rotation, besides the gadget decomposition approach, we also apply a recent proposed modulus raising technique to both schemes to alleviate memory pressure. We further optimize it by refining computational flow to reduce noise from scaling and maintain accumulator compatibility. For key switching, we address input-output parallelism mismatches, and offloading suitable computations to the CPU, effectively hiding latency through asynchronous execution. Additionally, we explore batching in bootstrapping, de- veloping a general framework that accommodates both schemes with either gadget decomposition or modulus raising method.Our experimental results demonstrate significant performance improvements. The proposed NTT implementation shows over 35% improvement compared to recent GPU implementations. On an RTX 4090 GPU, we achieve speedups of 371.86x and 390.44x for FHEW and TFHE gate bootstrapping, respectively, compared to OpenFHE running on a 48-thread CPU at a 128-bit security level. The corresponding throughputs are 7,007 and 11,378 operations per second. Furthermore, relative to the state-of-the-art GPU implementation [XLK+25], our approach provides speedups of 2.56x, 2.24x, and 2.33x for TFHE gate bootstrapping, homomorphic evaluation of arbitrary functions, and homomorphic flooring operation, respectively. Our VeloFHE surpasses some current hardware designs, offering an effective solution for more practical and efficient privacy-preserving computations.