Papers from Transactions on Cryptographic Hardware and Embedded Systems 2025

CryptoDB

Papers from Transactions on Cryptographic Hardware and Embedded Systems 2025

Year

Venue

Title

2025

TCHES

A Code-Based ISE to Protect Boolean Masking in Software Abstract

Qi Tian Hao Cheng Chun Guo Daniel Page Meiqin Wang Weija Wang

Side-Channel Attacks (SCAs) pose a significant threat to data security in embedded environments. To counteract the power-based SCAs, masking is a widely used defense technique, that introduces randomness to obscure the sidechannel information generated during the processing of secret data. However, in practice, some challenges exist when implementing masking schemes. For example, in the implementation of Boolean masking, they may refer to low noise level and implementation flaws. To address the said implementation challenges, we present an effective and efficient solution that incorporates the code-based masking technique: We mask the shares of Boolean masking with code-based masking and then use a selfdesigned Instruction Set Extension (ISE) to perform efficient private computations within this masked domain. Based on a 32-bit RISC-V Ibex core, we develop a prototype implementation of our ISE, whereby it mainly wraps the ALU with three code-based encoders/decoders and integrates a leakage-resilient pseudo-random generator (PRG). Compared to the base core (vanilla Ibex), the hardware overhead of the ISE implementation is only 8%. The security evaluation based on formal verification and practical evaluation demonstrates that our ISE can provide a more robust practical security guarantee. Furthermore, our approach significantly reduces the signal-to-noise ratio (SNR) of each share, decreasing it to just 2% of the original SNR on the base core.

2025

TCHES

A Fast Heuristic for Mapping Boolean Circuits to Functional Bootstrapping Abstract

Sergiu Carpov

Functional bootstrapping in FHE schemes such as FHEW and TFHE allows the evaluation of arbitrary functions on encrypted data, while simultaneously reducing noise. Implementing programs that directly use functional bootstrapping is challenging and error-prone. In this paper, we propose a heuristic that automatically maps Boolean circuits to functional bootstrapping instructions. Unlike prior approaches, our method does not limit the encrypted data plaintext space to a power-of-two size, allowing the instantiation of functional bootstrapping with smaller parameters. Furthermore, the negacyclic property of functional bootstrapping is exploited to extend the plaintext effective space. Despite the inherently greedy nature of the heuristic, experimental results show that the mapped circuits exhibit a significant reduction in evaluation time. Our heuristic demonstrates a 45% reduction in evaluation time when compared to hand-optimized Trivium and Kreyvium implementations.

2025

TCHES

A New Perspective on Key Switching for BGV-like Schemes Abstract

Johannes Mono Tim Güneysu

Fully homomorphic encryption is a promising approach when computing on encrypted data, especially when sensitive data is involved. For BFV, BGV, and CKKS, three state-of-the-art encryption schemes, the most costly homomorphic primitive is the so-called key switching. While a decent amount of research has been devoted to optimizing other aspects of these schemes, key switching has gone largely untouched. One exception has been a recent work [KLSS23] introducing a new double-decomposition technique. Their contributions are an important addition to the current state-of-the-art with one flaw: They take a limited perspective on key switching parameters and their asymptotic complexity which leads to incorrect conclusions about how effective their approach really is. In our work, we deep dive into key switching and correct, enhance, and improve the current state-of-the-art. We provide a new perspective on key switching parameters for the single- and doubledecomposition techniques, respectively, and show that the former outperforms the latter in most scenarios. Additionally, we revisit an idea by Gentry, Halevi, and Smart [GHS12b] and reduce the number of multiplications.

2025

TCHES

A New Trick for Polynomial Multiplication: A verified CRT polymul utilizing a monomial factor Abstract

Chun-Ming Chiu Bo-Yin Yang Bow-Yaw Wang

In this paper we present a novel transformation strategy for polynomial multiplications and apply it to NTRU Prime, specifically the parameter sets sntrup761 and ntrulpr761 working in the ring Z4591[x]/⟨x761−x−1⟩. To evaluate the practicality of our idea, we implemented the algorithm in C++ with ARM Neon intrinsics. By further exploiting the various optimization opportunities in the transformation process, we achieve state-of-the-art performance on Cortex-A72.Because of the aggressively lazy modular reduction strategy, overflows are of serious concern. Such errors in an optimized implementation are notoriously difficult to detect using traditional test vectors. To this end, the compiled binary file is formally verified using the tool CryptoLine. We use all the features in the current version of CryptoLine. This includes the Integer Set Library for range checking, plus the Logical Equivalence Checking to verify the correctness of the binary produced with the most optimized compiler setting by showing it as being equivalent to a binary from a less optimized compilation.

2025

TCHES

A TRAP for SAT: On the Imperviousness of a Transistor-Level Programmable Fabric to Satisfiability-Based Attacks Abstract

Aric Fowler Shayan Mohammed Mustafa Shihab Thomas Broadfoot Peter Beerel Carl Sechen Yiorgos Makris

Locking-based intellectual property (IP) protection for integrated circuits (ICs) being manufactured at untrusted facilities has been largely defeated by the satisfiability (SAT) attack, which can retrieve the secret key needed for instantiating proprietary functionality on locked circuits. As a result, redaction-based methods have gained popularity as a more secure way of protecting hardware IP. Among these methods, transistor-level programming (TRAP) prohibits the outright use of SAT attacks due to the mismatch between the logic-level at which SAT attack operates and the switch-level at which the TRAP fabric is programmed. Herein, we discuss the challenges involved in launching SAT attacks on TRAP and we propose solutions which enable expression of TRAP in propositional logic modeling in a way that accurately reflects switch-level circuit capabilities. Results obtained using a transistor-level SAT attack tool-set that we developed and are releasing corroborate that SAT attacks can be launched against TRAP. However, the increased complexity of switch-level circuit modeling prevents the attack from realistically compromising all but the most trivial IP-protected designs.

2025

TCHES

A5/3 make or break: A massively parallel FPGA architecture for exhaustive key search Abstract

Konstantina Miteloudi Lejla Batina Nele Mentens

In this paper, we have designed and implemented a massively parallel FPGA architecture for exhaustive key search on the A5/3 encryption algorithm. A5/3 is based on KASUMI, it has an effective key of 64 bits, and it is used in GSM (2G) mobile telephony systems. Despite the widespread adoption of more advanced technologies (4G, 5G), 2G networks remain as fallback options. In our novel hardware architecture, we use an AMD-Xilinx Alveo U250 card, with its FPGA configured to operate with 104 cores clocked at 496.7 M Hz, that can evaluate 235.59 keys/sec. Our results show that the $1 million attack can be achieved with 128 Alveo U250 cards, on average, in 16 days.

2025

TCHES

ABE Cubed: Advanced Benchmarking Extensions for ABE Squared Abstract

Sven Argo Marloes Venema Doreen Riepel Tim Güneysu Diego F. Aranha

Since attribute-based encryption (ABE) was proposed in 2005, it has established itself as a valuable tool in the enforcement of access control. For practice, it is important that ABE satisfies many desirable properties such as multi-authority and negations support. Nowadays, we can attain these properties simultaneously, but none of these schemes have been implemented. Furthermore, although simpler schemes have been optimized extensively on a structural level, there is still much room for improvement for these more advanced schemes. However, even if we had schemes with such structural improvements, we would not have a way to benchmark and compare them fairly to measure the effect of such improvements. The only framework that aims to achieve this goal, ABE Squared (TCHES ’22), was designed with simpler schemes in mind.In this work, we propose the ABE Cubed framework, which provides advanced benchmarking extensions for ABE Squared. To motivate our framework, we first apply structural improvements to the decentralized ciphertext-policy ABE scheme supporting negations presented by Riepel, Venema and Verma (ACM CCS ’24), which results in five new schemes with the same properties. We use these schemes to uncover and bridge the gaps in the ABE Squared framework. In particular, we observe that advanced schemes depend on more “variables” that affect the schemes’ efficiency in different dimensions. Whereas ABE Squared only considered one dimension (as was sufficient for the schemes considered there), we devise a benchmarking strategy that allows us to analyze the schemes in multiple dimensions. As a result, we obtain a more complete overview on the computational efficiency of the schemes, and ultimately, this allows us to make better-founded choices about which schemes provide the best efficiency trade-offs for practice.

2025

TCHES

Accelerating EdDSA Signature Verification with Faster Scalar Size Halving Abstract

Muhammad ElSheikh İrem Keskinkurt Paksoy Murat Cenk M. Anwar Hasan

This paper establishes that the extended Euclidean algorithm (EEA) implemented in a division-free manner is faster than the Lagrange algorithm with a similar level of optimization when it comes to halving the size of scalars found in the equations of elliptic curve signature verification. Our implementation results show that our EEA based method achieves roughly 4x speed-up for generating half- size scalars used in EdDSA. For the first time ever, EEA generated half-size scalars are used for verification of individual Ed25519 signatures yielding timing results that outperform ed25519-donna, a highly optimized open source implementation, by 16.12%. We also propose a new randomization method applied with half-size scalars to batch verification of Ed25519 signatures for which we report speed-ups compared to the well-known Bernstein et al. method for batch sizes larger than six, specifically, our method achieves 11.60% improvement for batch size 64.

2025

TCHES

Accelerating Hash-Based Polynomial Commitment Schemes with Linear Prover Time Abstract

Florian Hirner Florian Krieger Constantin Piber Sujoy Sinha Roy

Zero-knowledge proofs (ZKPs) are cryptographic protocols that enable one party to prove the validity of a statement without revealing any information beyond its truth. Central building blocks in many ZKPs are polynomial commitment schemes (PCS) where constructions with linear-time provers are especially attractive. Two such examples are Brakedown and its extension Orion, which enable linear-time and quantum-resistant proving by leveraging linear-time encodable Spielman codes. However, these PCS operate over large datasets, creating significant computational bottlenecks. For example, committing to and proving a degree 228 polynomial requires around 1.1 GB of data while taking 463 seconds on a high-end server CPU.This work addresses the performance bottleneck in Orion-like PCS by optimizing their most critical operations: Spielman encoding and Merkle commitments. These operations involve Gigabytes of data and suffer from random off-chip memory access patterns that drastically reduce off-chip bandwidth. We resolve this issue and introduce inverted expander graphs to eliminate random writes and reduce off-chip memory accesses by over 50%. Additionally, we propose an on-the-fly graph sampling method that avoids streaming large auxiliary data by generating expander graphs dynamically on-chip. We also provide a formal security proof for our proposed graph transformation. Beyond encoding, we accelerate Merkle Tree construction over large data sets through a scalable multi-pass SHA3 pipeline. Finally, we reutilize existing hardware components used in commitment to accelerate the so-called proximity and consistency checks during proof generation.Building upon these concepts, we present the first hardware architecture for PCS – with linear prover time – on an Xilinx Alveo U280 FPGA. In addition, we discuss the practical challenges of manually partitioning, placing, and routing our large-scale architecture to efficiently map it to the multi-SLR and HBM-equipped FPGA. The final implementation achieves a speedup of two orders of magnitude for full proof generation, covering commitment and proving steps. When combined with Virgo as an outer CP-SNARK protocol, our accelerator reduces end-to-end latency by up to 3.85x – close to the theoretical maximum of 3.9x.

2025

TCHES

Accelerating NTT with RISC-V Vector Extension for Fully Homomorphic Encryption Abstract

Tiago B. Rodrigues Alexandre Rodrigues Manuel Goulão Pedro Tomás Leonel Sousa

Fully Homomorphic Encryption (FHE) has gained increasing importance mainly due to its potential use in privacy-preserving cloud computing. This privacy stems from the computation being directly performed on data that is encrypted by the client. However, FHE comes with a major cost regarding computational requirements. When compared to processing in the unencrypted domain, the time it takes can be up to four orders of magnitude higher, which is particularly inconvenient for applications with time constraints. Hence, accelerating FHE is a major research line, by leveraging different mathematical schemes and algorithms to the use of specialized hardware accelerators targeting the most time-consuming operations. This paper targets the optimization of FHE by leveraging vectorized implementations in RISC-V processors, using the RISC-V Vector (RVV) extension. In particular, it implements and accelerates the Open-Source FHE library, OpenFHE, optimizing its Number Theoretic Transform (NTT) and the Inverse-NTT (INTT) components. In this library, different FHE algorithms (BGV, BFV, CKKS) were analyzed, optimized, and tested. For the NTT and INTT operations, a maximum speedup of 27.05x was obtained. Furthermore, for a multiplication with bootstrapping benchmark program in OpenFHE, a speedup of 1.94x for the CKKS scheme was attained. Additionally, neural network benchmarks exhibit a speedup of over 1.69x.

2025

TCHES

Adaptive Template Attacks on the Kyber Binomial Sampler Abstract

Eric Chun-Yu Peng Markus G. Kuhn

Template attacks build a Gaussian multivariate model of the side-channel leakage signal generated by each value of a targeted intermediate variable. Combined with additional steps, such as dimensionality reduction, such models can help to infer a value with nearly 100% accuracy from just a single attack trace. We demonstrate this here by reconstructing the output of the binomial sampler of a Cortex-M4 imple- mentation of the Kyber768 post-quantum key-encapsulation mechanism. However, this performance is usually significantly diminished if the device, or even just the ad- dress space, used for profiling differs from the attacked one. Here we introduce a new technique for adapting templates generated from profiling devices in order to attack another device where we are also able to record many traces, but without knowledge of the random value held by the targeted variable. We interpret the model from the profiling devices as a Gaussian mixture and use the Expectation–Maximization (EM) algorithm to adapt its means and covariances to better match the unlabelled leakage distribution observed from the attacked setting. The Kyber binomial sampler turned out to be a particularly suitable target, for two reasons. Firstly, it generates a long sequence of values drawn from a small set, limiting the number of Gaussian components that need to be adjusted. Secondly, the length of this sequence requires particularly well-adapted templates to achieve a high key-recovery success rate from a single trace. We also introduce an extended point-of-interest selection method to improve linear discriminant analysis (LDA).

2025

TCHES

AETHER: An Ultra-High Throughput and Low Energy Authenticated Encryption Scheme Abstract

Subhadeep Banik Andrea Caforio Tatsuya Ishikawa Takanori Isobe Mostafizar Rahman Kosei Sakamoto

In this paper, we introduce AETHER, an authenticated encryption scheme that achieves ultra-high throughput and low energy consumption, supporting a 256- bit key and a 128-bit tag. While inspired by an AEGIS-like structure, AETHER stands out with a completely redesigned round-update function. We replace the AES round function with a new inner function optimized for ultra-low latency and energy consumption. This function incorporates Orthros’s S-box and a 16x16 binary matrix from Akleylek et al., leading to a 1.56 times reduction in energy consumption and a 1.25 times reduction in delay compared to the AES round function. To further optimize hardware performance, we design the general construction of the roundupdate function to be more hardware-friendly, allowing parallel execution of the inner function on all 128-bit words, thereby enhancing both throughput and security against collision-based forgery attacks. AETHER achieves a throughput of 2.1 Tbit/s and an energy consumption of only 204.31 nJ, in the Nangate 15 nm standard cell library and a throughput of 5.23 Tbit/s and energy consumption of 1.83 nJ using the CNFET-OCL 5nm library, outperforming all existing AEADs.

2025

TCHES

Algebraic Linear Analysis for Number Theoretic Transform in Lattice-Based Cryptography Abstract

Chun-Ming Chiu Jiaxiang Liu Ming-Hsien Tsai Xiaomu Shi Bow-Yaw Wang Bo-Yin Yang

The topic of verifying postquantum cryptographic software has never been more pressing than today between the new NIST postquantum cryptosystem standards being finalized and various countries issuing directives to switch to postquantum or at least hybrid cryptography in a decade. One critical issue in verifying lattice-based cryptographic software is range-checking in the finite-field arithmetic assembly code which occurs frequently in highly optimized cryptographic software. For the most part these have been handled by Satisfiability Modulo Theory (SMT) but so far they mostly are restricted to Montgomery arithmetic and 16-bit precision. We add semi-automatic range-check reasoning capability to the CryptoLine toolkit via the Integer Set Library (wrapped via the python package islpy) which makes it easier and faster to verify more arithmetic crypto code, including Barrett and Plantard finite-field arithmetic, and show experimentally that this is viable on production code.

2025

TCHES

All You Need is XOR-Convolution: A Generalized Higher-Order Side-Channel Attack with Application to XEX/XE-based Encryptions Abstract

Rei Ueno Akira Ito Yosuke Todo Akiko Inoue Kazuhiko Minematsu Hibiki Ishikawa Naofumi Homma

The XEX/XE scheme has been widely used to realize authenticated encryptions (AEs), message authentication codes (MACs), and storage encryptions, such as OCB, PMAC, and XTS. Although these schemes have been extensively deployed in the real world, limited studies have evaluated side-channel attacks (SCAs) on them. In this study, we propose an efficient SCA that can be applied to the XEX/XE scheme. Despite the fact that the offset generated in these modes is guaranteed to have no full offset collision with an overwhelming probability, we analyze their offset-generating routines to exploit the partial offset collisions. Then, we propose a new profiled SCA named XOR-convoluting collision analysis (XCCA), which estimates the sum of keys from two leakages by XOR-convoluting probability distributions that model the leakages. The proposed collision SCA effectively erases the effect of random offsets by using XOR-convolution, whereas conventional collision SCAs are ineffective in this scenario. We validated the proposed SCA through simulations and experimental attacks using real traces. The results confirmed that the proposed SCA reduces the number of traces by up to 90% to achieve a success rate identical to that of a state-of-the-art SCA on OCB in TCHES 2022. Furthermore, we show that the proposed SCA distinguisher (XCCA distinguisher) is a generalization of higher-order SCAs, including non-collision SCAs on masked implementations. The profiled higher-order SCAs on masked implementations can be written in the form of an XCCA distinguisher using XOR-convolution with the new concept of leaking and target selection functions. The generalized representation clarifies how and why a higher-order SCA has better or worse performance from the theoretical viewpoint of noise amplification, which is also demonstrated through experiments and a spectrum analysis based on Walsh–Hadamard transform (WHT). Our analysis reveals that the random offsets of XEX/XE would work as masking from an SCA perspective, and XEX/XE-based encryption would have an inherent first-order SCA resilience under certain conditions.

2025

TCHES

All-You-Can-Compute: Packed Secret Sharing for Combined Resilience Abstract

Sebastian Faust Maximilian Orlt Kathrin Wirschem Liang Zhao

Unprotected cryptographic implementations are vulnerable to implementation attacks, such as passive side-channel attacks and active fault injection attacks. Recently, countermeasures like polynomial masking and duplicated masking have been introduced to protect implementations against combined attacks that exploit leakage and faults simultaneously. While duplicated masking requires O (t · e) shares to resist an adversary capable of probing t values and faulting e values, polynomial masking requires only O (t · e) shares, which is particularly beneficial for affine computation. At CHES’24, Arnold et al. showed how to further improve the efficiency of polynomial masking in the presence of combined attacks by embedding two secrets into one polynomial sharing. This essentially reduces the complexity of previous constructions by half. The authors also observed that using techniques from packed secret sharing (Grosso et al., CHES’13) cannot easily achieve combined resilience to encode an arbitrary number of secrets in one polynomial encoding. In this work, we resolve these challenges and show that it is possible to embed an arbitrary number of secrets in one encoding and propose gadgets that are secure against combined attacks. We present two constructions that are generic and significantly improve the computational and randomness complexity of existing compilers, such as the laOla compiler presented by Berndt et al. at CRYPTO’23 and its improvement by Arnold et al. For example, for an AES evaluation that protects against t probes and e faults, we improve the randomness complexity of the state-of-the-art construction when t + e > 3, leading to an improvement of up to a factor of 2.41.

2025

TCHES

Avengers assemble! Supervised learning meets lattice reduction: A single power trace attack against CRYSTALS-Kyber Key Generation Abstract

Pierre-Alain Fouque Damien Marion Quyen Nguyen Alexandre Wallet

In this paper, we attack Kyber’s key-generation algorithm using power analysis and lattice reduction. More specifically, we target the Centered Binomial Distribution (CBD) sampler which generates the secret data of the underlying Learning With Error (LWE) instance. From a side-channel perspective, our attack uses a single trace, leveraging classifiers developed through supervised learning. We enhance the classification with the AdaBoost strategy, which provides more reliable results and exploitable statistics, enabling the identification of error-free classified samples. In optimal scenarios, our classifiers, combined with the outputted statistics, allow us to recover up to 68% of the secret key’s coefficients from the trace, ensuring that these recovered coefficients are error-free. In such cases, we show that the secret keys can be recovered by Gaussian elimination over a finite field in a few seconds. For less advantageous cases, we assess the block-size in lattice reduction that would complete the key recovery, providing a fine-grained trade-offs between the correctly guessed proportion and the block-size, based on standard estimates. Finally, we conducted large-scale experiments, from power traces to secret key recovery (for most of the instances) under a threshold of 18 hours, targeting all three Kyber’s security levels. Our average rate of success across all security level is more than 96%.

2025

TCHES

BASTION: A Framework for Secure Third-Party IP Integration in NoC-based SoC Platforms Abstract

Francesco Restuccia Zhenghua Ma Joseph Zuckerman Andres Meza Biruk Seyoum Luca Carloni Ryan Kastner

Modern System-on-Chip (SoC) architectures are a complex mix of processors, accelerators, memories, and I/O controllers interconnected by on-chip communication networks. Given the complexity of the computation and the requirements mandated in modern applications, several of these IPs are often outsourced as third-party modules. The integration of third-party modules, however, has been demonstrated to raise severe system-level security concerns – undiscovered vulnerabilities, incorrect firmware configurations, malicious code, and hardware trojans undetected in such IPs can produce leaks of confidential information and compromise the integrity of critical components. These challenges are further intensified when the communication infrastructure lacks robust mechanisms to supervise and monitor the interactions of third-party IPs with the rest of the system. Thus, runtime monitoring and supervising of third-party IPs is a crucial aspect for the system-level security of the entire SoC – the computing modules integrated in the SoC and their communication must behave securely. This paper presents Bastion, an open-source framework designed to support the secure integration of third-party IP modules into SoC architectures based on network-on-chip (NoC) communications, with a focus on providing robust security guarantees for NoC-based open-source hardware platforms. Unlike most previous works, which either focus on design or verification, we address the challenge of securely integrating third-party IPs in NoC-based platforms through a holistic design and verification framework based on three pillars: (i) a high-performance security socket that can be seamlessly integrated into NoC tiles; (ii) secure configuration and management of the security sockets via a Hardware Root of Trust; and (iii) an ad-hoc property-based security verification framework to ensure secure system operation. Bastion is integrated on the popular open-source ESP framework and validated through simulations and FPGA emulation of realistic SoCs. By explicitly targeting open-source platforms and releasing the entire project as open-source, we aim to democratize access to robustly secure application-specific SoC platforms for critical applications and foster further advancements in this domain.

2025

TCHES

Chameleon: A Dataset for Segmenting and Attacking Obfuscated Power Traces in Side-Channel Analysis Abstract

Davide Galli Giuseppe Chiari Davide Zoni

Side-channel attacks exploit unintended information leakage emitted by cryptographic devices to extract sensitive data. Hiding techniques are a cost-effective countermeasure designed to obfuscate the side-channel leakage and hinder these attacks. Available open datasets rely on artificial models to simulate hiding effects, preventing a realistic assessment of these countermeasures and, thus, leaving a pressing need for datasets offering real-world, obfuscated side-channel measurements. Chameleon introduces the first comprehensive dataset of real-world, obfuscated power traces collected from a RISC-V-based System-on-Chip. The traces are obfuscated using four state-of-the-art hiding techniques: dynamic frequency scaling, random delay, morphing, and chaffing. Chameleon captures real leakage deformations introduced by actual hardware implementations, making it a realistic and valuable tool for evaluating side-channel countermeasures. A key feature of Chameleon is its dual focus on the segmentation and attack stages of the side-channel analysis process. It is the first dataset designed to facilitate the challenging task of segmenting cryptographic operations from obfuscated traces, offering precise metadata that pinpoints the start and end of each operation. The high-quality metadata enables systematic research into segmentation techniques, a critical step often overlooked in previous datasets. Chameleon provides an essential platform for researchers to develop and test new side-channel attacks, highlighting the vulnerabilities of current hiding techniques. By offering a more realistic assessment of countermeasure effectiveness, Chameleon is an invaluable tool for advancing the state-of-the-art in the side-channel evaluation.

2025

TCHES

CHERI-Crypt: Transparent Memory Encryption on Capability Architectures Abstract

Jennifer Jackson Minmin Jiang David Oswald

Capability architectures such as CHERI (Capability Hardware Enhanced RISC Instructions) are an emerging technology designed to provide memory safety protection at the hardware level and are equipped to eradicate approximately 70% of the current software vulnerability attack surface. CHERI is an instruction set architecture extension and has been applied to a small number of processors, including various versions of RISC-V. One of the benefits of CHERI is that it inherently provides segregation or compartmentalisation of software, making it suitable for supporting other types of applications such as Trusted Execution Environments, where sensitive data and computation is conducted inside a secure enclave, away from the rest of the untrusted operating system and services. To prevent untrusted software from accessing these compartments or secure regions of memory CHERI uses the mechanism of sealed capabilities. Trusted execution environments however, have been proven vulnerable to not just software-based attacks, but hardware attacks as well. In this paper we present our CHERI-Crypt design, an encryption engine extension to a CHERI-RISC-V 32-bit processor, for transparent memory encryption of sealed CHERI capabilities to additionally protect sensitive data in memory against physical hardware attacks. We show that our CHERI-Crypt design can run an enclave test program within an encrypted CHERI seal and invoke process, requiring 626 additional clock cycles with a batch size of 32 bytes. Adding CHERI-Crypt reduces the maximum frequency of the base CPU by only 6 MHz, and requires approximately 3.5x more flip flops and LUTs.

2025

TCHES

Code-based Masking: From Fields to Bits Bitsliced Higher-Order Masked SKINNY Abstract

John Gaspoz Siemen Dhooghe

Masking is one of the most prevalent and investigated countermeasures against side-channel analysis. As an alternative to the simple (e.g., additive) encoding function of Boolean masking, a collection of more algebraically complex masking types has emerged. Recently, inner product masking and the more generic codebased masking have proven to enable higher theoretical security properties than Boolean masking. In CARDIS 2017, Poussier et al. connected this “security order amplification” effect to the bit-probing model, demonstrating that for the same shared size, sharings from more complex encoding functions exhibit greater resistance to higher-order attacks. Despite these advantages, masked gadgets designed for code-based implementations face significant overhead compared to Boolean masking. Furthermore, existing code-based masked gadgets are not designed for efficient bitslice representation, which is highly beneficial for software implementations. Thus, current code-based masked gadgets are constrained to operate over words (e.g., elements in F2k ), limiting their applicability to ciphers where the S-box can be efficiently computed via power functions, such as AES. In this paper, we address the aforementioned limitations. We first introduce foundational masked linear and non-linear circuits that operate over bits of code-based sharings, ensuring composability and preserving bit-probing security, specifically achieving t-Probe Isolating Non-Interference (t-PINI). Utilizing these circuits, we construct masked ciphers that operate over bits, preserving the security order amplification effect during computation. Additionally, we present an optimized bitsliced masked assembly implementation of the SKINNY cipher, which outperforms Boolean masking in terms of randomness and gate count. The third-order security of this implementation is formally proven and validated through practical side-channel leakage evaluations on a Cortex-M4 core, confirming its robustness against leakages up to one million traces.

2025

TCHES

Constant time lattice reduction in dimension 4 with application to SQIsign Abstract

Otto Hanyecz Alexander Karenin Elena Kirshanova Péter Kutas Sina Schaeffler

In this paper we propose a constant time lattice reduction algorithm for integral dimension-4 lattices. Motivated by its application in the SQIsign postquantum signature scheme, we provide for the first time a constant time LLLlike algorithm with guarantees on the length of the shortest output vector. We implemented our algorithm and ensured through various tools that it indeed operates in constant time. Our experiments suggest that in practice our implementation outputs a Minkowski reduced basis and thus can replace a non constant time lattice reduction subroutine in SQIsign.

2025

TCHES

Constant-Cycle Hardware Private Circuits Abstract

Daniel Lammers Nicolai Müller Siemen Dhooghe Amir Moradi

The efficient implementation of Boolean masking with minimal overhead in terms of latency has become a critical topic due to the increasing demand for physically secure yet high-performance cryptographic primitives. However, achieving low latency in masked circuits while ensuring that glitches and transitions do not compromise their security remains a significant challenge. State-of-the-art multiplication gadgets, such as the recently introduced HPC4 (CHES 2024), offer composable security against glitches and transitions, as proven under the robust d-probing model. However, these gadgets require at least one clock cycle per computation, resulting in a latency overhead that increases with the algebraic degree. In contrast, LMDPL gadgets (CHES 2014 & CHES 2020) can achieve fixed latency independent of the algebraic degree, effectively addressing this issue. However, they are limited to two shares, and extending them to guarantee composable security at order d with d + 1 shares is considered an open challenge.In this work, we introduce Constant-Cycle Hardware Private Circuits (CCHPC), a novel hardware masking scheme built on the concept of LUT-based Masked Dual-Rail with Pre-charge Logic (LMDPL). Specifically, CCHPC achieves a fixed latency of d clock cycles by masking a Boolean function of arbitrary algebraic degree with d + 1 shares. CCHPC gadgets are secure and trivially composable, as formally proven under the Robust but Relaxed d-probing model (CHES 2024). Using CCHPC gadgets, we design a masked Advanced Encryption Standard (AES) encryption core which can be instantiated for an arbitrary number of d + 1 shares with a total latency of 11 + d clock cycles.

2025

TCHES

Cymric: Short-tailed but Mighty: Beyond-birthday-bound Secure Authenticated Encryption for Short Inputs Abstract

Alexandre Adomnicăi Wonseok Choi Yeongmin Lee Kazuhiko Minematsu Yusuke Naito

Authenticated encryption (AE) is a fundamental tool in today’s secure communication. Numerous designs have been proposed, including well-known standards such as GCM. While their performance for long inputs is excellent, that for short inputs is often problematic due to high overhead in computation, showing a gap between the real need for IoT-like protocols where packets are often very short. Existing dedicated short-input AEs are very scarce, the classical Encode-then-encipher (Bellare and Rogaway, Asiacrypt 2000) and Manx (Adomnicăi et al., CT-RSA 2023), using up to two block cipher calls. They have superior performance for (very) short inputs, however, security is up to n/2 bits, where n is the block size of the underlying block cipher. This paper proposes a new family of short-input AEs, dubbed Cymric, which ensure beyond-birthday-bound (BBB) security. It supports a wider range of input space than EtE and Manx with the help of one additional block cipher call (thus three calls). In terms of the number of block cipher calls, Cymric is the known minimum construction of BBB-secure AEs, and we also prove this is indeed minimal by presenting an impossibility result on BBB-secure AE with two calls. Finally, we show a comprehensive benchmark on microcontrollers to show performance advantage over existing schemes.

2025

TCHES

dCTIDH: Fast & Deterministic CTIDH Abstract

Fabio Campos Andreas Hellenbrand Michael Meyer Krijn Reijnders

This paper presents dCTIDH, a CSIDH implementation that combines two recent developments into a novel state-of-the-art deterministic implementation. We combine the approach of deterministic variants of CSIDH with the batching strategy of CTIDH, which shows that the full potential of this key space has not yet been explored. This high-level adjustment in itself leads to a significant speed-up. To achieve an effective deterministic evaluation in constant time, we introduce WOMBats, a new approach to performing isogenies in batches, specifically tailored to the behavior required for deterministic CSIDH using CTIDH batching. Furthermore, we explore the two-dimensional space of optimal primes for dCTIDH, with regard to both the performance of dCTIDH in terms of finite-field operations per prime and the efficiency of finite-field operations, determined by the prime shape, in terms of cycles. This allows us to optimize both for choice of prime and scheme parameters simultaneously. Lastly, we implement and benchmark constant-time, deterministic dCTIDH. Our results show that dCTIDH not only outperforms state-of-the-art deterministic CSIDH, but even non-deterministic CTIDH: dCTIDH-2048 is faster than CTIDH-2048 by 17%, and is almost five times faster than dCSIDH-2048.

2025

TCHES

Design and Implementation of a Physically Secure Open-Source FPGA and Toolchain Abstract

Sergej Meschkov Daniel Lammers Mehdi B. Tahoori Amir Moradi

The increasing prevalence of security breaches highlights the importanceof robust hardware security measures. Among these breaches, physical attacks– such as Side-Channel Analysis ( SCA) and Fault Injection (FI ) attacks – posea significant challenge for security-sensitive applications. To ensure robust systemsecurity throughout its lifecycle, hardware security updates are indispensable alongsidesoftware security patches. Programmable hardware plays a pivotal role in establishinga robust hardware root-of-trust, serving to effectively mitigate various hardwaresecurity threats. In this paper, we propose a methodology for the design of areconfigurable fabric and the corresponding mapping toolchain, specifically tailoredto hardware security. This approach offers resistance to various malicious physicalattacks, including SCA and FI , addressing each threat individually. As a case study,we propose a resulting fabric that implements a combination of first-order BooleanMasking and hiding countermeasures to provide strong protection against SCA attacksand enables the detection of fault injection attempts. In particular, we present howreconfigurable secure gadgets can be realized employing a reformed variant of theLUT-based Masked Dual-Rail with Pre-charge Logic (LMDPL) hardware maskingscheme and a modified version of Wave Dynamic Differential Logic ( WDDL) tobe composed into a fabric. We also show how any basic Hardware DescriptionLanguage ( HDL) design is automatically mapped to the primitives of our fabric,embedding provable hardware security, and bypassing the necessity for hardwaresecurity proficiency in this process. It is worth mentioning that our fabric requiresapproximately 85% less area to map a secure design compared to conventional FieldProgrammable Gate Arrays ( FPGAs). A practical security evaluation of our securefabric implementation on a real FPGA target board, using Test Vector LeakageAssessment (TVLA), demonstrated no SCA leakage over 100 million traces.

2025

TCHES

Designing a General-Purpose 8-bit (T)FHE Processor Abstraction Abstract

Daphné Trama Aymen Boudguiga Pierre-Emmanuel Clet Renaud Sirdey Nicolas Ye

Making the most of TFHE programmable bootstrapping to evaluate functions or operators otherwise challenging to perform with only the native addition and multiplication of the scheme is a very active line of research. In this paper, we systematize this approach and apply it to build an 8-bit FHE processor abstraction, i.e., a software entity that works over FHE-encrypted 8-bit data and presents itself to the programmer by means of a conventional-looking assembly instruction set. In doing so, we provide several homomorphic LookUp Table (LUT) dereferencing operators based on variants of the tree-based method and show that they are the most efficient option for manipulating encryptions of 8-bit data (optimally represented as two basis 16 digits). We then systematically apply this approach over a set of around 50 instructions, including, notably, conditional assignments, divisions, or fixed-point arithmetic operations. We further test the approach on several simple algorithms, including the execution of a neuron with a sigmoid activation function with 16-bit precision. We conclude the paper by comparing our work to the FHE compilers available in the state of the art. Finally, this work reveals that a very limited set of functional bootstrapping patterns is versatile and efficient enough to achieve general-purpose FHE computations beyond the boolean circuit approach. As such, these patterns may be an appropriate target for further works on advanced software optimizations or hardware implementations.

2025

TCHES

ECTester: Reverse-engineering side-channel countermeasures of ECC implementations Abstract

Vojtech Suchanek Jan Jancar Jan Kvapil Petr Svenda Łukasz Chmielewski

Developers implementing elliptic curve cryptography (ECC) face a wide range of implementation choices created by decades of research into elliptic curves. The literature on elliptic curves offers a plethora of curve models, scalar multipliers, and addition formulas, but this comes with the price of enabling attacks to also use the rich structure of these techniques. Navigating through this area is not an easy task and developers often obscure their choices, especially in black-box hardware implementations. Since side-channel attackers rely on the knowledge of the implementation details, reverse engineering becomes a crucial part of attacks.This work presents ECTester – a tool for testing black-box ECC implementations. Through various test suites, ECTester observes the behavior of the target implementation against known attacks but also non-standard inputs and elliptic curve parameters. We analyze popular ECC libraries and smartcards and show that some libraries and most smartcards do not check the order of the input points and improperly handle the infinity point. Based on these observations, we design new techniques for reverse engineering scalar randomization countermeasures that are able to distinguish between group scalar randomization, additive, multiplicative or Euclidean splitting. Our techniques do not require side-channel measurements; they only require the ability to set custom domain parameters, and are able to extract not only the size but also the exact value of the random mask used. Using the techniques, we successfully reverse-engineered the countermeasures on 13 cryptographic smartcards from 5 major manufacturers – all but one we tested on. Finally, we discuss what mitigations can be applied to prevent such reverse engineering, and whether it is possible at all.

2025

TCHES

Efficient and Compact Full-Domain Functional Bootstrapping via Subring Folding Abstract

Han Xia Mingsheng Wang

Functional bootstrapping has emerged as a powerful tool in fully homomorphic encryption (FHE), integrating noise reduction and function evaluation. FHEW/TFHE-based functional bootstrapping has demonstrated high efficiency in evaluating arbitrary non-linear functions, such as typical activation functions in neural networks. Due to the algebraic properties of coefficient embedding over power-of-two cyclotomics, early constructions were limited to evaluating either a negacyclic function under full-domain encoding or an arbitrary function under half-domain encoding. To combine the advantages of arbitrary function evaluation and full-domain encoding, recent works have introduced various techniques to address the challenges posed by negacyclicity. Among these efforts, Xia et al. (TCC 2024) recently showed that this limitation can be entirely circumvented by evaluating the so-called equality test function. However, the substantial noise overhead in their theoretical framework presents a significant challenge toward concretely efficient implementations.In this work, we propose a new full-domain functional bootstrapping algorithm by refining the framework of Xia et al. Our approach introduces a new homomorphic equality test and incorporates several key insights into the Blind Rotation procedure (Chillotti et al., J. Cryptol. 2020), leveraging algebraic properties of power-of-two cyclotomics and their subrings. The resulting algorithm offers the following notable improvements over previous works: (1) one of the most compact bootstrapping key sizes due to significantly reduced noise growth, (2) the most efficient with parallelism enabled by the independence of all involved Blind Rotations, and (3) the ability to evaluate an unbounded number of functions over the same input ciphertext with minimal additional computational cost. Notably, our proof-of-concept implementation under IND-CPAD-secure parameters with multi-threaded CPU execution demonstrates that our new algorithm achieves a 1.4−1.6x speedup compared to the state-of-the-art, alongside a 1.5−2.1x reduction in key size for plaintext precision of 4−6 bits. Our method provides a promising pathway toward parallel-friendly fulldomain functional bootstrapping, with considerable potential for further acceleration on hardware platforms such as GPUs, FPGAs, and ASICs.

2025

TCHES

Efficient Homomorphic Integer Computer from CKKS Abstract

Jaehyung Kim

As Fully Homomorphic Encryption (FHE) enables computation over encrypted data, it is a natural question of how efficiently it handles standard integer computations like 64-bit arithmetic. It has long been believed that the CGGI/DM family or the BGV/BFV family are the best options, depending on the size of the parallelism. The discrete variant of CKKS, suggested by Drucker et al. [J.Cryptol.’24], provides an interesting alternative for integer computations. Notably, the modular reduction framework proposed by Kim and Noh [CiC’25] built on top of the CKKSstyle functional bootstrapping by Bae et al. [Asiacrypt’24] gives an efficient arithmetic modulo small integers.In this work, we propose a novel homomorphic computer for unsigned integer computations. We represent a large integer (e.g. 64-bit) as a vector of smaller chunks (e.g. 4-bit) and construct arithmetic operations relying on discrete CKKS. The proposed scheme supports many of the operations supported in TFHE-rs while outperforming it in terms of amortized running time. Notably, our homomorphic 64-bit multiplication takes 8.85ms per slot, which is more than three orders of magnitude faster than TFHE-rs.

2025

TCHES

Entropy extractor based high-throughput post-processings for True Random Number Generators Abstract

Yifan Dang Miloš Grujić Bohan Yang Wenping Zhu Hanning Wang Min Zhu Ingrid Verbauwhede Leibo Liu

In cryptographic systems, true random number generation is essential, as a compromised TRNG could lead to a security catastrophe. The raw random numbers are discrete values that are derived at discrete points in time from a noise source of a TRNG. These values often exhibit statistical defects that require post-processing, also called conditioner, to improve uniformity. The two main types of post-processing are algorithmic post-processing and cryptographic post-processing, both of which have pros and cons in theories and applications. However, another type of postprocessing existing between these two types, named entropy extractor, has often been overlooked by the applied cryptographic community. Therefore, we implement two information-theoretically provable entropy extractors: Toeplitz extractor and Trevisan extractor catering to various performance requirements and applications of high-throughput TRNG post-processing. This paper proposes a combination of matrix chunking and FFT acceleration to boost the performance of the Toeplitz extractor, along with a modified Toeplitz matrix design to decrease the hardware consumption. In addition, we introduce a lightweight single-bit extractor to implement an efficient Trevisan extractor. Both algorithms are devised and verified through FPGA hardware simulations. The enhanced Toeplitz extractor achieves a throughput of 42 Gbps, while the Trevisan extractor attains 1.82 Gbps, representing an 84% and 73% improvement in throughput-to-area ratio over the previous best-performing design for each extractor. The standard statistical test suites, such as NIST SP800-22, NIST SP800-90B, and AIS-31, are adopted to evaluate the effectiveness of the proposed post-processing techniques. Naturally, this approach can only serve as a supplementary measure, as modern standards, such as AIS-31, necessitate formal analysis and stochastic models to account for randomness.

2025

TCHES

Fast AVX-512 Implementation of the Optimal Ate Pairing on BLS12-381 Abstract

Hao Cheng Georgios Fotiadis Johann Großschädl Daniel Page

Non-degenerate bilinear maps on elliptic curves, commonly referred to as pairings, have many applications including short signature schemes, zero-knowledge proofs and remote attestation protocols. Computing a state-of-the-art pairing at the 128-bit security level, such as the optimal ate pairing over the curve BLS12-381, is very costly due to the high complexity of some of its sub-operations: most notable are the Miller loop and final exponentiation. In the past ten years, a few optimized pairing implementations have been introduced in the literature, but none of those took advantage of the vector (SIMD) extensions of state-of-the-art Intel and AMD CPUs, especially AVX-512; this is surprising, because doing so offers the potential to reach significant speed-ups. Consequently, the questions of 1) how computation of the optimal ate pairing can be effectively vectorized, and 2) what execution time such a vectorized implementation can achieve are still open. This paper addresses said questions by introducing a carefully-optimized AVX-512 implementation of the optimal ate pairing on BLS12-381. A central feature of the implementation is the use of 8-way Integer Fused Multiply-Add (IFMA) instructions, which are capable to execute eight 52 x 52-bit multiplications in a SIMD-parallel fashion. We introduce new vectorization strategies and describe optimizations of existing ones to speed up arithmetic operations in the extension fields Fp4 , Fp6 , and Fp12 as well as certain higher-level functions. Furthermore, we discuss some parallelization bottlenecks and how they impact execution time. We benchmarked our pairing software, which we call avxbls, on an Intel Core i3-1005G1 (“Ice Lake”) CPU and found that it needs 1, 265, 314 clock cycles (resp. 1, 195, 236 clock cycles) for the full pairing, with the Granger-Scott cyclotomic squaring (resp. compressed cyclotomic squaring) being used in the final exponentiation. For comparison, the non-vectorized (i.e., scalar) x64 assembly implementation from the widely-used blst library has an execution time of 2, 351, 615 cycles, which is 1.86 times (resp. 1.97 times) slower. avxbls also outperforms Longa’s implementation (CHES 2023) by almost the same factor. The practical importance of these results is amplified by Intel’s recent announcement to support AVX10, which includes IFMA instructions, in all future CPUs.

2025

TCHES

Faster amortized bootstrapping using the incomplete NTT for free Abstract

Thales B. Paiva Gabrielle De Micheli Syed Mahbub Hafiz Marcos A. Simplicio Jr. Bahattin Yildiz

Amortized bootstrapping techniques have been proposed for FHEW/TFHE to efficiently refresh multiple ciphertexts simultaneously within a polynomial modulus. Although recent proposals have very efficient asymptotic complexity, reducing the amortized cost essentially to Õ(1) FHE multiplications, the practicality of such algorithms still suffers from substantial overhead and high decryption failure rates (DFR). In this study, we improve upon one of the state-of-the-art amortized bootstrapping algorithms (Guimarães et al., ASIACRYPT 2023) for FHEW/TFHE-like schemes by introducing an alternative algorithmic strategy. Specifically, we combine Guimarães et al.’s strategy based on a two-part NTT with an incomplete Number Theoretic Transform (NTT) algorithm. The resulting construction is such that the multiplication of higher-degree polynomials that would usually create a bottleneck in an incomplete NTT setting actually comes for free. As a result, we demonstrate a 2.12x speedup compared to the algorithm of Guimarães et al. and a 1.12x improvement over the state-of-the-art (sequential) TFHE-rs while achieving a DFR close to 2−32 for 7-bit messages, although the DFR is higher for 8-bit messages. We also explore trade-offs between execution time and DFR, identifying parameter sets that improve the execution time of Guimarães et al. by 1.41x, while simultaneously reducing the DFR by a factor of 2−22 for 8-bit messages.

2025

TCHES

Fault Attacks on ECC Signature Verification Abstract

Kevin Schneider Lukas Auer Alexander Wagner

Signature verification operations used in secure boot or firmware updates are the foundation of trusted devices. ECC-based signature schemes are preferred for these applications due to their smaller key and signature sizes. Despite their widespread use, we would like to highlight that there is no research available that analyzes the resilience of ECC-based signature verification operations against fault attacks. Therefore, we thoroughly investigate the feasibility of fault attacks on ECC-based signature verification. We cover both theoretical and implementation-specific attacks. We demonstrate that faults in elliptic curve points and parameters allow an adversary to forge signatures in ECGDSA and ECSDSA, while ECDSA and EdDSA remain resilient. The weakness lies in the Weierstraß curves used in the affected schemes. This allows an adversary to perform cryptographic operations on much weaker curves by corrupting at least a single bit. To assess the severity in practice, we evaluate two open-source secure boot implementations—MCUboot and wolfBoot—that use fault injection hardening. Interestingly, these examples do not employ any hardening within the underlying cryptographic libraries. We discovered several attacks on the implementation of the ECDSA and EdDSA verification algorithms. Here, a single instruction skip is sufficient to accept trivially forged signatures. To improve these and future implementations, we propose effective and efficient countermeasures. Our work fills a critical gap to motivate further research for more resilient cryptographic implementations.

2025

TCHES

Fault Injection Evaluation with Statistical Analysis: How to Deal with Nearly Fabricated Large Circuits Abstract

Felix Uhle Nicolai Müller Amir Moradi

A critical aspect of securing cryptographic hardware is their resistance to Fault Injection (FI) attacks, which involve the successful injection of faults into the system in operation. Specifically, a hardware design must be resilient to wellestablished fault injection techniques, including voltage or clock glitching, laser fault injections, and the more recently introduced Electromagnetic Fault Injection (EMFI). Ideally, the protection level must be verified before the chip is fabricated. Although initial efforts to verify the resistance of hardware designs against fault injection have been made, analyzing the security of practical designs with realistic gate counts under fault injections that affect multiple gates or the entire circuit state remains a significant challenge. This scenario, however, is considered more realistic than assessing resistance to a fixed, relatively small number of faults. In this work, we introduce FIESTA, a versatile automated framework for analyzing the resistance of hardware circuits under the general random fault model. By leveraging a nonexhaustive approach, FIESTA is capable of evaluating larger designs compared to state-of-the-art tools, while maintaining a reasonable level of confidence. FIESTA supports various adversary models, allowing customized resistance analysis against specific adversaries. In particular, we present a concrete procedure for evaluating more realistic precise adversaries, based on practical observations. Using FIESTA, we assessed the resistance of several (protected) Advanced Encryption Standard (AES) cores.

2025

TCHES

FINAL bootstrap acceleration on FPGA using DSP-free constant-multiplier NTTs Abstract

Jonas Bertels Hilder V. L. Pereira Ingrid Verbauwhede

This work showcases Quatorze-bis, a state-of-the-art Number Theoretic Transform circuit for TFHE-like cryptosystems on FPGAs. It contains a novel modular multiplication design for modular multiplication with a constant for a constant modulus. This modular multiplication design does not require any DSP units or any dedicated multiplier unit, nor does it require extra logic when compared to the state-of-the-art modular multipliers. Furthermore, we present an implementation of a constant multiplier Number Theoretic Transform design for TFHE-like schemes. Lastly, we use this Number Theoretic Transform design to implement a FINAL hardware accelerator for the AMD Alveo U55c which improves the Throughput metric of TFHE-like cryptosystems on FPGAs by a factor 9.28x over Li et al.’s NFP CHES 2024 accelerator and by 10-25% over the absolute state-of-the-art design FPT [vBDTV23] while using one third of FPTs DSPs.

2025

TCHES

FusionMSM: A Collision-Free and Arithmetic-Optimized FPGA-based Accelerator for Multi-Scalar Multiplication Abstract

Cheng Chen Gangqiang Yang Hongchao Zhou Hailiang Xiong Xianye Ben Zhiguo Wan

Zero-knowledge Proof (ZKP), is an effective cryptographic primitive that allows one party to verify the correctness of a given statement without disclosing any additional information. It plays a central role in applications such as blockchain transactions and cryptocurrencies. However, implementations of ZKP suffer from the most time-consuming task called Multi-Scalar Multiplication (MSM). Existing works and evaluation criteria primarily emphasize speed enhancement, but overlook optimizations of area overhead. In this paper, a FPGA-based accelerator FusionMSM is designed to reduce the overall latency but also improve area overhead. We attribute the bottleneck of MSM to a three-layer pyramid, including the finite field arithmetic, point operations on elliptic curves and scheduling. For modular arithmetic, we propose an efficient and non-Montgomery modular multiplier by utilizing hybrid multiplication strategy and optimizing multi-bit LUT-based modular reduction. It obtains 1.11 x less area cost and 2.00 x speed-up versus the modular multipliers used in ZKP acceleration works. For point operations, we design a unified and fully pipelined point addition unit, which can run at 500 MHz, the highest frequency in the reported works. On top of that, we present a greedy mechanism to resolve potential collisions, which can reduce the idle cycles of the point addition unit and improve its utilization. As far as we know, FusionMSM achieves the best performance compared to other FPGA-based and ASIC-based works for the input sizes from 218 to 226. For the degree of 220, FusionMSM only needs 12.4% of time in Hardcaml, 24.54% of time in PipeMSM on FPGA, and 36.41% of time in ASIC-based work PipeZK. It also utilizes less resources, resulting in a 90.93% reduction in URAMs, 35.24% reduction in FFs and 47.59% reduction in CARRY8s. Compared to GPU-based implementations, FusionMSM delivers comparable performance but with a lower power of 24.5 W.

2025

TCHES

Generation of Fast Finite Field Arithmetic forCortex-M4 with ECDH and SQIsign Applications Abstract

Felix Carvalho Rodrigues Décio Gazzoni Filho Gora Adj Isaac A. Canales-Martínez Jorge Chávez-Saab Julio López Michael Scott Francisco Rodríguez-Henríquez

Finite field arithmetic is central to several cryptographic algorithms on embedded devices like the ARM Cortex-M4, particularly for elliptic curve and isogenybased cryptography. However, rapid algorithm evolution, driven by initiatives such as NIST’s post-quantum standardization, might frequently render hand-optimized implementations obsolete. We address this challenge with m4-modarith, a library generating C code with inline assembly for the Cortex-M4 that rivals custom-tuned assembly, enabling agile development in this ever-changing landscape. Our generated modular multiplications obtains fast performances, competitive with hand-optimized assembly implementations published in the literature, even outperforming some of them for Curve25519. Two contributions are pivotal to this success. First, we introduce a novel multiplication strategy that matches the memory access complexity of the operand caching method while being applicable to a larger cache size for Cortex-M4 implementations. Second, we generalize an efficient pseudo-Mersenne reduction strategy, and formally prove its correctness and applicability for most primes of cryptographic interest. Our generator allowed agile optimization of SQIsign’s NIST PQC Round 2 submission, improving level 1 verification from 123 Mcycles to only 54 Mcycles, a 2.3x speedup. As an additional case study, we use our generator to improve performance of portable implementations of RFC 7748 by up to 2.2x.

2025

TCHES

HADES: Automated Hardware Design Exploration for Cryptographic Primitives Abstract

Fabian Buschkowski Georg Land Niklas Höher Jan Richter-Brockmann Pascal Sasdrich Tim Güneysu

While formal constructions for cryptographic schemes have steadily evolved and emerged over the past decades, the design and implementation of efficient and secure hardware instances are still mostly manual, tedious, and intuition-driven processes. With the increasing complexity of modern cryptography, e.g., Post-Quantum Cryptography (PQC) schemes, and consideration of physical implementation attacks, e.g., Side-Channel Analysis (SCA), the design space often grows exorbitantly without developers being able to weigh all design options.This emphasizes the evident necessity for tool-assisted Design Space Exploration (DSE) for efficient and secure cryptographic hardware. To address this demand, we present the HADES framework. This tool systematically traverses the design space driven by security requirements, rapidly predicts user-defined performance metrics, e.g., area footprint or cycle-accurate latency, and instantiates the most suitable candidate in a synthesizable Hardware Description Language (HDL).We demonstrate the capabilities of our framework by applying our proof-of-concept implementation to a wide-ranging selection of symmetric and PQC schemes, including the ChaCha20 stream cipher and the PQC standard Kyber. Notably, for these schemes, we present the first hardware implementations featuring arbitrary-order masking.

2025

TCHES

Higher-Order Time Sharing Masking Abstract

Dilip Kumar S. V. Siemen Dhooghe Josep Balasch Benedikt Gierlichs Ingrid Verbauwhede

At CHES 2024, Time Sharing Masking (TSM) was introduced as a novel low-latency masking technique for hardware circuits. TSM offers area and randomness efficiency, as well as glitch-extended PINI security, but it is limited to first-order security. We address this limitation and generalize TSM to higher-order security while maintaining all of TSM’s advantages. Additionally, we propose an area-latency tradeoff. We prove HO-TSM glitch-extended PINI security and successfully evaluate our circuits using formal verification tools. Furthermore, we demonstrate area- and latency-efficient implementations of the AES S-box, which do not exhibit leakage in TVLA on FPGA. Our proposed tradeoff enables a first-order secure implementation of a complete AES-128 encryption core with 92 kGE, 920 random bits per round, and 20 cycles of latency, which does not exhibit leakage in TVLA on FPGA.

2025

TCHES

HIPR: Hardware IP Protection through Low-Overhead Fine-Grain Redaction Abstract

Aritra Dasgupta Sudipta Paria Swarup Bhunia

Hardware intellectual property (IP) blocks have been subjected to various forms of confidentiality and integrity attacks in recent years due to the globalization of the semiconductor industry. System-on-chip (SoC) designers are now considering a zero-trust model for security, where an IP can be attacked at any stage of the manufacturing process for piracy, cloning, overproduction, or malicious alterations. Hardware redaction has emerged as a promising countermeasure to thwart confidentiality and integrity attacks by untrusted entities in the globally distributed supply chain. However, existing redaction techniques provide this security at high overhead costs, making them unsuitable for real-world implementation. In this paper, we propose HIPR, a fine-grain redaction methodology that is robust, scalable, and incurs significantly lower overhead compared to existing redaction techniques. HIPR redacts security-critical Boolean and sequential logic from the hardware design, performs interconnect randomization, and employs multiple overhead optimization steps to reduce overhead costs. We evaluate HIPR on open-source benchmarks and reduce area overheads by 1 to 2 orders of magnitude compared to state-of-the-art redaction techniques without compromising security. We also demonstrate that the redaction performed by HIPR is resilient against conventional functional and structural attacks on hardware IPs. The redacted test IPs used to evaluate HIPR are available at: https://github.com/UF-Nelms-IoT-Git-Projects/HIPR.

2025

TCHES

HRaccoon: A High-performance Configurable SCA Resilient Raccoon Hardware Accelerator Abstract

Ziying Ni Ayesha Khalid Zhaoyu Zhang Yijun Cui Weiqiang Liu Máire O’Neill

The lattice-based Raccoon scheme is one of the candidates in Round 1 of the National Institute of Standards and Technology (NIST) post-quantum cryptography (PQC) additional digital signatures standardization process. As a scheme with built-in masking features, Raccoon is also a viable candidate for NIST’s Masking Circuit and Threshold Cryptography project. Current Raccoon implementations are limited to software or software-hardware co-designs only and consequently lacking in terms of high throughput performance that hardware implementations can generally promise. To achieve this, we are the first to propose a configurable and high-performance pure hardware architecture for Raccoon. The proposed FPGA architecture features extensive optimizations in key modules for Raccoon such as the modular reduction, polynomial operations, and sampling. The segmentation and loop-based scheduling scheme interacts with the defined BRAM-based memory access pattern to ensure efficient and coherent data flow under the three security levels and two masking modes (non- and first-order masking). Implementation results of Raccoon on an AMD Artix- 7 FPGA device show that our proposed architecture achieves a 1.4–2.1x speedup compared to software implementations and a 20–42x speedup compared to softwarehardware co-designs for the three security levels, despite its hardware area being comparable to that of the lightweight CRYSTALS-Dilithium architecture. Finally, a TVLA test is demonstrated on Raccoon-128 with non-masking and first-order masking to evaluate its resilience to side-channel attacks.

2025

TCHES

Improved Attacks Against Lattice-Based KEMs Using Hints From Hertzbleed Abstract

Zhiwei Li Jun Xu Jun Song Haomeng Xu Yan Jia Yanli Zou Lei Hu

The Number Theoretic Transform (NTT) is widely employed to accelerate computations in lattice-based cryptography. At CHES 2024, Yu et al. introduced a class of side-channel attacks targeting NTT operations in the simplified Kyber and NTTRU schemes. Their work demonstrated that side-channel leakages - modeled as modular hints - can reveal partial information about the private key. These modular hints were subsequently integrated into the Learning With Errors (LWE) or NTRU lattices to reduce the overall computational complexity of key recovery. However, their approach fails to fully exploit the potential of these modular hints. Our key observation is that these modular hints is sufficient to directly construct lowdimensional lattices, rather than integrating them into the original high-dimensional one.In this paper, for the simplified CPA-secure Kyber scheme, we directly utilize the extracted modular hints to construct low-dimensional lattices. Subsequently, the adversary leverages lattice reduction algorithms to search for non-zero shortest vectors within these lattices. Our experimental results confirm that the full private key can be recovered within 400 seconds on a personal computer. Therefore, our attack practically recovers the private key. However, the method proposed by Yu et al. at CHES 2024 cannot achieve this.Furthermore, for the CCA-secure NTTRU scheme, we extract additional modular hints based on the side-channel methodology proposed by Yu et al. We combine the special structure of the NTTRU private key with the Gaussian elimination to generate low-dimensional lattices, and subsequently estimate the hardness of solving the non-zero Shortest Vector Problem using the estimation methodology adopted by Yu et al. The results indicate that we reduce the computational complexity of key recovery to 234-a significant improvement over the 2114 computational complexity reported by Yu et al. at CHES 2024.

2025

TCHES

Improving MPCitH with Preprocessing: Mask Is All You Need Abstract

Guowei Liu Guoxiao Liu Kaijie Jiang Qingyuan Yu Keting Jia Puwen Wei Meiqin Wang

The MPC-in-the-head with preprocessing (MPCitH-PP) paradigm presents a novel approach for constructing post-quantum digital signatures like Picnic3. This paper revisits the MPCitH-PP construction, analyzing both its offline and online phases and proposing a reformulation of the protocol. By identifying redundant computations in these phases, we optimize them into a single phase, thereby enhancing the efficiency of MPCitH-PP. Furthermore, we explore the independence of the mask, demonstrating that it can be calculated in parallel, which also enables the optimization of the masked witness calculation.Our optimized implementation of Picnic3 shows significant improvements. At the L1 security level, the optimal software implementation reduces MPCitH-PP calculation time to about 30% of the previous implementation. The optimal signature implementation costs about 78% of the previous implementation time. At the L5 security level, MPCitH-PP with parallelism optimal is reduced to about 26% of the previous solution’s time, and the optimal signature implementation runs at about 53% of the previous solution’s time. For the hardware implementation, our optimizations reduce the clock cycles of MPCitH-PP from r sequential rounds to a single parallel round, where r denotes the number of rounds in the LowMC algorithm, with little change in hardware usage, and perform better in AT product, especially for parallel computing.

2025

TCHES

Information Theoretic Analysis of PUF-Based Tamper Protection Abstract

Georg Maringer Matthias Hiller

PUFs enable physical tamper protection for high-assurance devices without needing a continuous power supply that is active over the entire lifetime of the device. Several methods for PUF-based tamper protection have been proposed together with practical quantization and error correction schemes. In this work we take a step back from the implementation to analyze theoretical properties and limits. We apply zero leakage output quantization to existing quantization schemes and minimize the reconstruction error probability under zero leakage. We apply wiretap coding within a helper data algorithm to enable a reliable key reconstruction for the legitimate user while guaranteeing a selectable reconstruction complexity for an attacker, analogously to the security level for a cryptographic algorithm for the attacker models considered in this work. We present lower bounds on the achievable key rates depending on the attacker’s capabilities in the asymptotic and finite blocklength regime to give fundamental security guarantees even if the attacker gets partial information about the PUF response and the helper data. Furthermore, we present converse bounds on the number of PUF cells. Our results show for example that for a practical scenario one needs at least 459 PUF cells using 3 bit quantization to achieve a security level of 128 bit.

2025

TCHES

KeyVisor – A Lightweight ISA Extension for Protected Key Handles with CPU-enforced Usage Policies Abstract

Fabian Schwarz Jan Philipp Thoma Christian Rossow Tim Güneysu

The confidentiality of cryptographic keys is essential for the security of protection schemes used for communication, file encryption, and outsourced compu- tation. Beyond cryptanalytic attacks, adversaries can steal keys from memory via software exploits or side channels, enabling them to, e.g., manipulate confidential information or impersonate key owners. Therefore, existing defenses protect keys in dedicated devices or isolated memory, or store them only in encrypted form. However, these designs often provide unfavorable tradeoffs, sacrificing performance, fine-grained access control, or deployability.In this paper, we present KeyVisor, a lightweight Instruction Set Architecture ( ISA) extension that securely offloads the handling of symmetric crypto keys to the CPU. KeyVisor provides CPU instructions that enable applications to request protected key handles and perform AEAD cipher operations on them. The underlying keys are accessible only by KeyVisor, and thus never leak to memory. KeyVisor’s direct CPU integration enables fast crypto operations and hardware-enforced key usage restrictions, e.g., keys usable only for de-/encryption, with a limited lifetime, or with a process binding. Furthermore, privileged software, e.g., the monitor firmware of TEEs, can revoke keys or bind them to a specific process/TEE. We implement KeyVisor for RISC-V based on RocketChip, evaluate its performance, and demonstrate real-world use cases, including key-value databases, automotive feature licensing, and a read-only network middlebox.

2025

TCHES

KyberSlash: Exploiting secret-dependent division timings in Kyber implementations Abstract

Daniel J. Bernstein Karthikeyan Bhargavan Shivam Bhasin Anupam Chattopadhyay Tee Kiah Chia Matthias J. Kannwischer Franziskus Kiefer Thales B. Paiva Prasanna Ravi Goutam Tamvada

This paper presents KyberSlash1 and KyberSlash2 – two timing vulnerabilities in several implementations (including the official reference code) of the Kyber Post-Quantum Key Encapsulation Mechanism, recently standardized as ML-KEM. We demonstrate the exploitability of both KyberSlash1 and KyberSlash2 on two popular platforms: the Raspberry Pi 2 (Arm Cortex-A7) and the Arm Cortex-M4 microprocessor. Kyber secret keys are reliably recovered within minutes for KyberSlash2 and a few hours for KyberSlash1. We responsibly disclosed these vulnerabilities to maintainers of various libraries and they have swiftly been patched. We present two approaches for detecting and avoiding similar vulnerabilities. First, we patch the dynamic analysis tool Valgrind to allow detection of variable-time instructions operating on secret data, and apply it to more than 1000 implementations of cryptographic primitives in SUPERCOP. We report multiple findings. Second, we propose a more rigid approach to guarantee the absence of variable-time instructions in cryptographic software using formal methods.

2025

TCHES

Leading Degree: A Metric for Model Performance Evaluation and Hyperparameter Tuning in Deep Learning-Based Side-Channel Analysis Abstract

Junfan Zhu Jiqiang Lu

Side-channel analysis benefits a lot from deep learning techniques, which assist attackers in recovering the secret key with fewer attack traces than before, but it remains a problem to precisely measure deep learning model performance, so as to obtain a high-performance model. Commonly used evaluation metrics for deep learning like accuracy and precision cannot well meet the demand due to their deviation in side-channel analysis, and classical evaluation metrics for side-channel analysis like guessing entropy, success rate and TGE1 are not generic because they effectively evaluate model performance in only one of the two situations: whether models manage to recover the secret key with given attack traces or not, and not efficient because they need to be performed multiple times to counteract randomness. To attain an effective generic side-channel evaluation metric, we investigate the deterministic component of power consumption, find that the elements of score vector under a key follow a linearly transformed chi-square distribution approximately, and some wrong key hypotheses usually with top scores provide great assistance in model performance evaluation, and finally we propose a new metric called Leading Degree (LD) as well as its simplified version LD-simplified for measuring model performance, which offers similar accuracy but much better generality and efficiency compared with the classical side-channel benchmark metric TGE1, and offers similar generality and efficiency but significantly better accuracy compared with recently proposed sidechannel metrics like Label Correlation and Cross Entropy Ratio. LD/LD-simplified can be easily deployed in early stopping to avoid overfitting phenomena, and we build a bridge between LD/LD-simplified and TGE1, by observing an exponential relationship, which significantly shortens the estimating time for TGE1. At last, we apply LD as a reward function to better solve the reward function design problem in reinforcement learning-based model hyperparameter tuning of side-channel analysis, and obtain better CNN model architectures compared with the state-of-the-art models obtained by previous hyperparameter tuning methods.

2025

TCHES

Leaky McEliece: Secret Key Recovery From Highly Erroneous Side-Channel Information Abstract

Marcus Brinkmann Chitchanok Chuengsatiansup Alexander May Julian Nowakowski Yuval Yarom

The McEliece cryptosystem is a strong contender for post-quantum schemes, including key encapsulation for confidentiality of key exchanges in network protocols. A McEliece secret key is a structured parity check matrix that is transformed via Gaussian elimination into an unstructured public key. We show that this transformation is highly critical with respect to side-channel leakage. We assume leakage of the elementary row operations during Gaussian elimination, motivated by McEliece implementations in the cryptographic libraries Classic McEliece and Botan.We propose a novel decoding algorithm to reconstruct a secret key from its public key with information from a Gaussian transformation leak. Even if the obtained side-channel leakage is extremely noisy, i.e., each bit is flipped with probability as high as r ≈ 0.4, we succeed to recover the secret key in a matter of minutes for all proposed (Classic) McEliece instantiations. Remarkably, for high-security McEliece parameters, our attack is more powerful in the sense that it can tolerate even larger r . We demonstrate our attack on the constant-time reference implementation of Classic McEliece in a single-trace setting, using an STM32L592 ARM processor.Our result stresses the necessity of properly protecting highly structured code-based schemes such as McEliece against side-channel leakage.

2025

TCHES

LESS is Even More: Optimizing Digital Signatures from Code Equivalence Abstract

Luke Beckwith Andre Esser Edoardo Persichetti Paolo Santini Floyd Zweydinger

LESS is a signature scheme based on the code equivalence problem that has advanced to the second round of the NIST PQC standardization process. While promising, the scheme suffers from relatively large signatures and moderate to slow signing and verification times. Chou, Santini, and Persichetti recently introduced a variant of LESS relying on canonical forms to significantly reduce signature sizes. However, the overall performance impact of this approach remained largely unclear. In this work, we provide the first implementation of the new LESS variant and show that, in its original form, it performs poorly due to the overhead of computing canonical forms in a naïve way. We then introduce a series of algorithmic and implementation-level optimizations that reduce this overhead to about 10%, showing that the signature size reduction comes at minor cost. In addition, we present further improvements to the signature scheme as a whole, as well as a re-parameterization. The resulting scheme achieves speedups of 2.5x to 10x over the Round 1 NIST submission, while maintaining the reduced signature sizes.

2025

TCHES

Let us walk on the 3-isogeny graph: efficient, fast, and simple Abstract

Jesús-Javier Chi-Domínguez Eduardo Ochoa-Jiménez Ricardo-Neftalí Pontaza-Rodas

Constructing and implementing isogeny-based cryptographic primitives is an active research. In particular, performing length-n isogenies walks over quadratic field extensions of Fp plays an exciting role in some constructions, including Hash functions, Verifiable Delay Functions, Key-Encapsulation Mechanisms, and generic proof systems for isogeny knowledge. Remarkably, many isogeny-based constructions, for efficiency, perform 2-isogenies through square root calculations.This work analyzes the idea of using 3-isogenies instead of 2-isogenies, which replaces the requirement of calculating square roots with cube roots. Performing length-m 3-isogenies allows shorter isogeny walks than when employing length-n 2-isogenies since a cube root calculation costs essentially the same as computing a square root, and we require 3m ≈ 2n to provide the same security level.We propose an efficient mapping from arbitrary supersingular Montgomery curves defined over Fp2 to the 3-isogeny curve model from Castryck, Decru, and Vercauteren (Asiacrypt 2020); a deterministic algorithm to compute all order-3 points on arbitrary supersingular Montgomery curves, and an efficient algorithm to compute length-m 3-isogeny chains.We improve the length-m 3-isogeny walks required by the KEM from Nakagawa and Onuki (CRYPTO 2024) by using our results and introducing more suitable parameter sets that are friendly with C-code implementations. In particular, our experiments illustrate an improvement between 26.41% and 35.60% in savings when calculating length-m 3-isogeny chains and using our proposed parameters instead of those proposed by Nakagawa and Onuki (CRYPTO 2024).Finally, we enhance the key generation of CTIDH-2048 by including radical 3-isogeny chains over the basefield Fp, reducing the overhead of finding a 3-torsion basis as required in some instantiations of the CSIDH protocol. Our experiments illustrate the advantage of radical 3 isogenies in the key generation of CTIDH-2048, with an improvement up to 4 times faster than the original CTIDH.

2025

TCHES

Let’s DOIT: Using Intel’s Extended HW/SW Contract for Secure Compilation of Crypto Code Abstract

Santiago Arranz-Olmos Gilles Barthe Benjamin Grégoire Jan Jancar Vincent Laporte Tiago Oliveira Peter Schwabe

It is a widely accepted standard practice to implement cryptographic software so that secret inputs do not influence the cycle count. Software following this paradigm is often referred to as “constant-time” software and typically involves following three rules: 1) never branch on a secret-dependent condition, 2) never access memory at a secret-dependent location, and 3) avoid variable-time arithmetic operations on secret data. The third rule requires knowledge about such variable-time arithmetic instructions, or vice versa, which operations are safe to use on secret inputs. For a long time, this knowledge was based on either documentation or microbenchmarks, but critically, there were never any guarantees for future microarchitectures. This changed with the introduction of the data-operand-independent-timing (DOIT) mode on Intel CPUs and, to some extent, the data-independent-timing (DIT) mode on Arm CPUs. Both Intel and Arm document a subset of their respective instruction sets that are intended to leak no information about their inputs through timing, even on future microarchitectures if the CPU is set to run in a dedicated DOIT (or DIT) mode.In this paper, we present a principled solution that leverages DOIT to enable cryptographic software that is future-proof constant-time, in the sense that it ensures that only instructions from the DOIT subset are used to operate on secret data, even during speculative execution after a mispredicted branch or function return location. For this solution, we build on top of existing security type systems in the Jasmin framework for high-assurance cryptography.We then use our solution to evaluate the extent to which existing cryptographic software built to be “constant-time” is already secure in this stricter paradigm implied by DOIT and what the performance impact is to move from constant-time to future-proof constant-time.

2025

TCHES

Masking-Friendly Post-Quantum Signatures in the Threshold-Computation-in-the-Head Framework Abstract

Thibauld Feneuil Matthieu Rivain Auguste Warmé-Janville

Side-channel attacks pose significant threats to cryptographic implementations, which require the inclusion of countermeasures to mitigate these attacks. In this work, we study the masking of state-of-the-art post-quantum signatures based on the MPC-in-the-head paradigm. More precisely, we focus on the recent threshold-computation-in-the-head (TCitH) framework that applies to some NIST candidates of the post-quantum standardization process. We first provide an analysis of side-channel attack paths in the signature algorithms based on the TCitH framework. We then explain how to apply standard masking to achieve a d-probing secure implementation of such schemes, with performance scaling in O(d2), for d the masking order.Our main contribution is to introduce different ways to tweak those signature schemes towards their masking friendliness. While the TCitH framework comes in two variants, the GGM variant and the Merkle tree variant, we introduce a specific tweak for each of these variants. These tweaks allow us to achieve complexities of O(d) and O(d log d) at the cost of non-constant signature size, caused by the inclusion of additional seeds in the signature. We also propose a third tweak that takes advantage of the threshold secret sharing used in TCitH. With the right choice of parameters, we show how, by design, some parts of the TCitH algorithms satisfy probing security without additional countermeasures. While this approach can substantially reduce the cost of masking in some part of the signature algorithm, it degrades the soundness of the core zero-knowledge proof, hence slightly increasing the size of the signature.We analyze the complexity of the masked implementations of our tweaked TCitH signatures and provide benchmarks on a RISC-V platform with built-in hash accelerator. We use a modular benchmarking approach, allowing to estimate the performance of diverse signature instances with different tweaks and parameters. Our results illustrate how the different variants scale for an increasing masking order. For instance, for a masking order d = 3, we obtain signatures of around 14 kB that run in 0.67 second on a the target RISC-V CPU with a 250MHz frequency. This is to be compared with the 4.7 seconds required by the original signature scheme masked at the same order on the same platform. For a masking order d = 7, we obtain a signature of 17.5 kB running in 1.75 second, to be compared with 16 seconds for the stardard masked signature.Finally, we discuss the extension of our techniques to signature schemes based on the VOLE-in-the-Head framework, which shares similarities with the GGM variant of TCitH. One key takeaway of our work is that the Merkle tree variant of TCitH is inherently more amenable to efficient masking than frameworks based on GGM trees, such as TCitH-GGM or VOLE-in-the-Head.

2025

TCHES

mid-pSquare: Leveraging the Strong Side-Channel Security of Prime-Field Masking in Software Abstract

Brieuc Balon Lorenzo Grassi Pierrick Méaux Thorben Moos François-Xavier Standaert Matthias Johann Steiner

Efficiently protecting embedded software implementations of standard symmetric cryptographic primitives against side-channel attacks has been shown to be a considerable challenge in practice. This is, in part, due to the most natural countermeasure for such ciphers, namely Boolean masking, not amplifying security well in the absence of sufficient physical noise in the measurements. So-called primefield masking has been demonstrated to provide improved theoretical guarantees in this context, and the Feistel for Prime Masking (FPM) family of Tweakable Block Ciphers (TBCs) has been recently introduced by Grassi et al. (Eurocrypt’24) to efficiently leverage these advantages. However, it was so far only instantiated for and empirically evaluated in a hardware implementation context, by using a small (7-bit) prime modulus.In this paper, we build on the theoretical incentive to increase the prime field size to obtain improved side-channel (Faust et al., Eurocrypt’24) and fault (Moos et al., CHES’24) resistance, as well as on the practical incentive to instantiate an FPM instance with optimized performance on 32-bit software platforms. We introduce mid-pSquare for this purpose, a lightweight TBC operating over a 31-bit Mersenne prime field. We first provide an in-depth black-box security analysis with a particular focus on algebraic attacks – which, contrary to the cryptanalysis of instances over smaller primes, are more powerful than statistical ones in our setting. We also design a strong tweak schedule to account for potential related-tweak algebraic attacks which, so far, are almost unknown in the literature. We then demonstrate that mid-pSquare implementations deliver very competitive performance results on the target platform compared to analogous binary TBCs regardless of masked or unmasked implementation (we use fix-sliced SKINNY for our comparisons). Finally, we experimentally establish the side-channel security improvements that masked mid-pSquare can lead to, reaching unmatched resistance to profiled horizontal attacks on lightweight 32-bit processors (ARM Cortex-M4).

2025

TCHES

MulLeak: Exploiting Multiply Instruction Leakage to Attack the Stack-optimized Kyber Implementation on Cortex-M4 Abstract

Fan Huang Xiaolin Duan Chengcong Hu Mengce Zheng Honggang Hu

CRYSTALS-Kyber, one of the NIST PQC standardization schemes, has garnered considerable attention from researchers in recent years for its side-channel security. Various targets have been explored in previous studies; however, research on extracting secret information from stack-optimized implementations targeting the Cortex-M4 remains scarce, primarily due to the lack of memory access operations, which increases the difficulty of attacks.This paper shifts the focus to the leakage of multiply instructions and present a novel cycle-level regression-based leakage model for the following attacks. We target the polynomial multiplications in decryption process of the stack-optimized implementation targeting the Cortex-M4, and propose two regression-based profiled attacks leveraging known ciphertext and chosen ciphertext methodologies to recover the secret coefficients individually. The later one can also be extended to the protected implementation.Our practical evaluation, conducted on the stack-optimized Kyber-768 implementation from the pqm4 repository, demonstrates the effectiveness of the proposed attacks. Focusing on the leakage from the pair-pointwise multiplication, specifically the macro doublebasemul_frombytes_asm, we successfully recover all secret coefficients with a success rate exceeding 95% using a modest number of traces for each attack. This research underscores the potential vulnerabilities in PQC implementations against side-channel attacks and contributes to the ongoing discourse on the physical security of cryptographic algorithms.

2025

TCHES

Multi-Value Plaintext-Checking and Full-Decryption Oracle-Based Attacks on HQC from Offline Templates Abstract

Haiyue Dong Qian Guo

The Hamming Quasi-Cyclic (HQC) key encapsulation mechanism (KEM), recently selected by NIST for standardization in the Post-Quantum Cryptography (PQC) process, distinguishes itself through its efficiency, robust design based on hard decoding problems in coding theory, and well-characterized decryption failure rates. Despite its selection, practical security concerns arise from implementation threats, particularly those exploiting plaintext-checking (PC) oracles. While multi-value PC (MV-PC) and full decryption (FD) oracle attacks have been extensively studied in the context of lattice-based cryptography, their applicability to code-based schemes like HQC has remained relatively unexplored.In this work, we present the first MV-PC and FD oracle attacks targeting codebased KEMs, specifically on HQC. Our MV-PC attack significantly reduces the required oracle queries compared to previous PC oracle-based methods and holds implications for side-channel, key-mismatch, and fault-injection attacks. Our FD attack exhibits remarkable efficiency in trace complexity, achieving secret key recovery for hqc-128 with just two queries to a perfect oracle, and four queries for hqc-192 and hqc-256. Simulations further demonstrate the robustness of our MV-PC and FD oracle attacks against imperfect oracle responses. We experimentally validate the new attacks on an ARM Cortex-M4 microcontroller, highlighting the critical need for robust countermeasures. In particular, on such a platform, substantial leakage during operations like syndrome computation poses significant challenges to the efficiency of masking techniques in mitigating FD oracle attacks.

2025

TCHES

New Quantum Cryptanalysis of Binary Elliptic Curves Abstract

Kyungbae Jang Vikas Srivastava Anubhab Baksi Santanu Sarkar Hwajeong Seo

This paper improves upon the quantum circuits required for the Shor’s attack on binary elliptic curves. We present two types of quantum point addition, taking both qubit count and circuit depth into consideration.In summary, we propose an in-place point addition that improves upon the work of Banegas et al. from CHES’21, reducing the qubit count – depth product by more than 73% – 81% depending on the variant. Furthermore, we develop an out-of-place point addition by using additional qubits. This method achieves the lowest circuit depth and offers an improvement of over 92% in the qubit count – quantum depth product (for a single step).To the best of our knowledge, our work improves from all previous works (including the CHES’21 paper by Banegas et al., the IEEE Access’22 paper by Putranto et al., and the CT-RSA’23 paper by Taguchi and Takayasu) in terms of circuit depth and qubit count – depth product.Equipped with the implementations, we discuss the post-quantum security of the binary elliptic curve cryptography. Under the MAXDEPTH metric (proposed by the US government’s NIST), the quantum circuit with the highest depth in our work is 224, which is significantly lower than the MAXDEPTH limit of 240. For the gate count – full depth product, a metric for estimating quantum attack cost (proposed by NIST), the highest complexity in our work is 260 for the curve having degree 571 (which is comparable to AES-256 in terms of classical security), considerably below the post-quantum security level 1 threshold (of the order of 2156).

2025

TCHES

On the Average Random Probing Model Abstract

Julien Béguinot Loïc Masure

Masking is one of the main countermeasures against side-channel analysis since it relies on provable security. In this context, “provable” means that a security bound can be exhibited for the masked implementation through a theoretical analysis in a given threat model. The main goal in this line of research is therefore to provide the tightest security bound, in the most realistic model, in the most generic way. Yet, all of these objectives cannot be reached together. That is why the masking literature has introduced a large spectrum of threat models and reductions between them, depending on the desired trade-off with respect to these three goals. In this paper, we focus on three threat models, namely the noisy-leakage model (realistic yet hard to work with), the random probing (unrealistic yet easy to work with), and more particularly a third intermediate model called average random probing. Average random probing has been introduced by Dziembowski et al. at Eurocrypt 2015, in order to exhibit a tight reduction between noisy-leakage and random probing models, recently proven by Brian et al. at Eurocrypt 2024. This milestone has strong practical consequences, since otherwise the reduction from the noisy leakage model to the random probing model introduces a prohibitively high constant factor in the security bound, preventing security evaluators to use it in practice. However, we exhibit a gap between the average random probing definitions of Dziembowski et al. (denoted hereafter by DFS-ARP) and Brian et al. (simply denoted by ARP). Whereas any noisy leakage can be tightly reduced to DFS-ARP, we show in this paper that it cannot be tightly reduced to ARP, unless requiring extra assumptions, e.g., if the noisy leakage is deterministic. Our proof techniques do not involve more tools than the one used so far in such reductions, namely basic probability facts, and known properties of the total variation distance. As a consequence, the reduction from the noisy leakage to the random probing — without high constant factor — remains unproven. This stresses the need to clarify the practical relevance of analyzing the security of masking in the random probing model since most of the current efforts towards improving the constructions and their security proofs in the random probing model might be hindered by potentially unavoidable loss in the reduction from more realistic but currently less investigated leakage models.

2025

TCHES

On the Characterization of Phase Noise for the Robust and Resilient PLL-TRNG Design Abstract

Ziheng Ma Bohan Yang Wenping Zhu Hanning Wang Yi Ouyang Min Zhu Leibo Liu

A true random number generator (TRNG) is a critical component in ensuring the security of cryptographic systems. Among TRNG implementations, the phase-locked loop-based TRNG (PLL-TRNG) is a widely adopted solution for FPGA platforms due to the availability of a stochastic model. In the previous study, this stochastic model was based on analog noise signals, which potentially led to an oversimplification of the PLL physical process and resulted in an overestimation of entropy. To address this limitation, we extract key platform-specific parameters of the PLL and develop a new stochastic model tailored for multi-output PLL-TRNGs. For the first time, we reveal the effect of the PLL’s bandwidth on the correlation of sampling points and introduce a method for quantitatively controlling sampling point correlations. Finally, we validate the model through on-chip jitter measurements. Experimental results show that the proposed stochastic model accurately describes the behavior of the PLL-TRNG and provides the most conservative entropy lower bound, with a 1.8-fold improvement in jitter resolution.

2025

TCHES

Optimal Dimensionality Reduction using Conditional Variational AutoEncoder Abstract

Sana Boussam Mathieu Carbone Benoît Gérard Guénaël Renault Gabriel Zaid

The benefits of using Deep Learning techniques to enhance side-channel attacks performances have been demonstrated over recent years. Most of the work carried out since then focuses on discriminative models. However, one of their major limitations is the lack of theoretical results. Indeed, this lack of theoretical results, especially concerning the choice of neural network architecture to consider or the loss to prioritize to build an optimal model, can be problematic for both attackers and evaluators. Recently, Zaid et al. addressed this problem by proposing a generative model that bridges conventional profiled attacks and deep learning techniques, thus providing a model that is both explicable and interpretable. Nevertheless the proposed model has several limitations. Indeed, the architecture is too complex, higher-order attacks cannot be mounted and desynchronization is not handled by this model. In this paper, we address the first limitation namely the architecture complexity, as without a simpler model, the other limitations cannot be treated properly. To do so, we propose a new generative model that relies on solid theoretical results. This model is based on conditional variational autoencoder and converges towards the optimal statistical model i.e. it performs an optimal attack. By building on and extending the state-of-the-art theoretical works on dimensionality reduction, we integrate into this neural network an optimal dimensionality reduction i.e. a dimensionality reduction that is achieved without any loss of information. This results in a gain of O(D), with D the dimension of traces, compared to Zaid et al. neural network in terms of architecture complexity, while at the same time enhancing the explainability and interpretability. In addition, we propose a new attack strategy based on our neural network, which reduces the attack complexity of generative models from O(N) to O(1), with N the number of generated traces. We validate all our theoretical results experimentally using extensive simulations and various publicly available datasets covering symmetric, asymmetric pre and post-quantum cryptography implementations.

2025

TCHES

OPTIMSM: FPGA hardware accelerator for Zero-Knowledge MSM Abstract

Xander Pottier Thomas de Ruijter Jonas Bertels Wouter Legiest Michiel Van Beirendonck Ingrid Verbauwhede

The Multi-Scalar Multiplication (MSM) is the main barrier to accelerating Zero-Knowledge applications. In recent years, hardware acceleration of this algorithm on both FPGA and GPU has become a popular research topic and the subject of a multi-million dollar prize competition (ZPrize). This work presents OPTIMSM: Optimized Processing Through Iterative Multi-Scalar Multiplication. This novel accelerator focuses on the acceleration of the MSM algorithm for any Elliptic Curve (EC) by improving upon the Pippenger algorithm. A new iteration technique is introduced to decouple the required buckets from the window size, resulting in fewer EC computations for the same on-chip memory resources. Furthermore, we combine known optimizations from the literature for the first time to achieve additional latency improvements. Our enhanced MSM implementation significantly reduces computation time, achieving a speedup of up to x12.77 compared to recent FPGA implementations. Specifically, for the BLS12-381 curve, we reduce the computation time for an MSM of size 224 to 914 ms using a single compute unit on the U55C FPGA or to 231 ms using four U55C devices. These results indicate a substantial improvement in efficiency, paving the way for more scalable and efficient Zero-Knowledge proof systems.

2025

TCHES

POTA: A Pipelined Oblivious Transfer Acceleration Architecture for Secure Multi-Party Computation Abstract

Li Xiaolin Yan Wei Liu Hongwei Zhang Yong Hao Qinfen Liu Yong Sun Ninghui

With the rapid development and deployment of machine learning (ML) and big data technologies, which rely heavily on sensitive user data for training and inference, ensuring privacy and data security has become a pressing challenge. Addressing this issue requires methods that safeguard sensitive information while maintaining the correctness of computational results. Secure multi-party computation (MPC), as a representative application of cryptographic techniques, offers a technical solution to this challenge by enabling privacy-preserving computations. It has been widely applied in scenarios such as cloud-based inference and other privacy-sensitive tasks. However, MPC also introduces significant performance overhead, thus limiting its further application. Our analysis reveals that the foundational element of MPC, the oblivious transfer (OT) protocol collectively account for up to 96.64% of the execution time. It is because the OT protocols are constrained by low network band- width and weak compute engines. To address these challenges, we propose POTA, a high-performance pipelined OT hardware acceleration architecture supporting the silent OT protocol. In the POTA design, we develop efficient subsystems targeting the two most compute-intensive parts: the construction of puncturable pseudoran- dom function (PPRF), and large matrix-vector multiplications under the learning parity with noise (LPN) assumption within the silent OT protocol. In addition, to address the performance overhead caused by data transfer between POTA and the host CPU, we design a host-accelerator execution pipeline to hide the considerable transmission latency. Furthermore, we design a modular multiplication module over a finite field to generate the more complex correlations required by MPC protocols. Finally, we implement a POTA prototype on Xilinx VCU129 FPGAs. Experimental results demonstrate that under various network settings, POTA achieves significant speedups, with maximum improvements of 192.57x for basic operations and 597.57x for convolutional neural networks (CNN).

2025

TCHES

Practical Opcode-based Fault Attack on AES-NI Abstract

Xue Gong Xin Zhang Qianmei Wu Fan Zhang Junge Xu Qingni Shen Zhi Zhang

AES New Instructions (AES-NI) is a set of hardware instructions introduced by Intel to accelerate AES encryption and decryption, significantly improving efficiency across various cryptographic applications. While AES-NI effectively mitigates certain side-channel attacks, its resilience against faults induced by active or malicious fault injection remains unclear.In this paper, we conduct a comprehensive security analysis of AES-NI. By analyzing the opcodes of AES-NI, we identify six pairs of instructions with only a single-bit difference, making them susceptible to bit-flip-type attacks. This vulnerability allows attackers to recover AES keys in both Electronic Codebook (ECB) and Cipher Block Chaining (CBC) modes. We introduce a novel Opcode-based Fault Analysis (OFA) method, employing Gaussian elimination to reduce the search space of the last round key. In particular, with one pair of correct and faulty ciphertexts, OFA can reduce the key search space to 232 for a 128-bit key length. To further reduce the key space for AES-192 and AES-256, we propose the Enhanced Opcode-based Fault Analysis (EOFA), which, compared to exhaustive search, reduces the key space by factors of 2160 and 2192, respectively.Finally, we demonstrate the feasibility of our findings by conducting physical endto- end attacks. Specifically, Rowhammer is leveraged to flip vulnerable opcodes and OFA as well as EOFA techniques are applied to recover secret keys from AES implementations. Our experimental results for AES-ECB-128, AES-ECB-192, and AES-CBC-128 demonstrate that key recovery can be efficiently achieved within 1.03 to 1.36 hours, varying with the cipher. This work highlights a critical vulnerability in AES-NI and outlines a new and novel pathway for fault-based attacks against modern cryptographic implementations.

2025

TCHES

pracy: A Practical Compiler for Attribute-Based Encryption in Python Abstract

Sven Argo Marloes Venema Adrian Ackermann Tim Güneysu

Attribute-based encryption (ABE) is a versatile primitive that has been considered in many applications to enforce access control cryptographically. To actually benefit from ABE in practice, we require implementations of schemes that satisfy all the properties that are needed. Many theoretical advancements have been made to attain such properties, ultimately resulting in powerful abstractions such as pair encodings. To build an ABE scheme, we use a compiler (in the theoretical sense), which transforms a provably secure pair encoding scheme into a provably secure ABE scheme. Although several such compilers have been introduced, they all abstract away many details that are relevant for engineers, which can hinder the implementation of schemes in practice.To address this problem, we propose pracy, which is a tool that automatically implements an ABE scheme from an input pair encoding scheme. To achieve this, we first note that we need to overcome a general issue in any automation efforts – including automated optimization and security analysis – in the field of pairing-based cryptography. In particular, there exist no parsers that properly model the interaction between the predicate and the pair encodings. Therefore, we devise a new formal model and type system, which capture this interaction in a way that is compatible with automated implementation efforts. To illustrate the feasibility of our model and system, we construct pracy, which is a (practical) compiler in Python that can implement ABE schemes in multiple target programming languages such as Python and C/C++. With pracy, we not only make the implementation of ABE schemes from pair encodings more accessible to practitioners, we realize the potential that pair encodings have to simplify implementation efforts.

2025

TCHES

Primitive-Level vs. Implementation-Level DPA Security: a Certified Case Study: (Pleading for Standardized Leakage-Resilient Cryptography) Abstract

Charles Momin François-Xavier Standaert Corentin Verhamme

Implementation-level countermeasures like masking can be applied to any cryptographic algorithm in order to mitigate Differential Power Analysis (DPA). Leveraging re-keying with a Leakage-Resilient PRF (LR-PRF) is an alternative countermeasure that requires a change of primitive. Both options rely on different security mechanisms: signal-to-noise ratio amplification for masking, signal reduction for LRPRFs. This makes their general comparison difficult and suggests the investigation of relevant case studies to identify when to use one or the other as an interesting research direction. In this paper, we provide such a case study and compare the security that can be obtained by using an unprotected hardware coprocessor, to be integrated into a leakage-resilient PRF, and a certified one, protected with implementation-level countermeasures. Both are available on “commercial off-the-shelf” devices and could be used for lightweight IoT applications. We first perform an in-depth analysis of these targets. It allows us to put forward the different evaluation challenges that they raise, and the similar to slightly better cost vs. security tradeoff that the leakage-resilient PRF offers in our experiments. We then discuss the advantages and limitations of both types of countermeasures. While there are contexts where the higher flexibility of masking is needed, we conclude that there are also applications that would strongly benefit from the simplicity of the LR-PRF’s design and evaluation. Positing that the lack of standards is the main impediment to their more widespread deployment, we therefore hope that our results can motivate such standardization efforts.

2025

TCHES

Probing Secure Composability Without Fresh Randomness: Theory and Application to Ascon Abstract

Vahid Jahandideh Bart Mennink Lejla Batina

Side-channel attacks (SCAs) pose a significant threat to the implementations of lightweight ciphers, particularly in resource-constrained environments where masking—the primary countermeasure—is constrained by tight resource limitations. This makes it crucial to reduce the resource and randomness requirements of masking schemes. In this work, we investigate an approach to minimize the randomness complexity of masking algorithms. Specifically, we explore the theoretical foundations of higher-order masking schemes that eliminate the need for online (fresh) randomness by relying solely on offline randomness present in the initial input shares.We demonstrate that round-based ciphers with linear diffusion layers can support such deterministic composition, where the diffusion layer acts as a refresh subcircuit. This ensures that, up to a threshold number, probes placed across rounds remain independent. Based on this observation, we propose composition theorems for probing-secure masking. On the practical side, we instantiate our framework using known deterministic first- and second-order masked S-boxes and provide software implementations of Ascon’s protected permutation.

2025

TCHES

Protection of Oscillator-Based PUFs against Side Channel Analyses by Random Interruption Abstract

Holger Mandry Jonas Ruchti Julius Nowakowski Michael Pehl Maurits Ortmanns

Oscillation-based physical unclonable functions (PUFs) are known to be sensitive to power trace side channel analyses (SCAs). Although previous work investigated on countermeasures, these required significant additional amount of hardware or were just able to obscure sign information of a frequency comparison, while the magnitude information remains available to the attacker. As recent innovation on oscillation-based PUFs also require the magnitude-information beside the sign, e.g., to increase the reliability, the need arises to protect both. We present a new protection approach to hide both sign and magnitude information of oscillation-based PUF from an attacker. By introducing random interruptions in the oscillation, the power spectrum is blurred while the quality of the PUF is maintained. In addition to concept simulations and the discussion of different implementations, we use the example of a loop-PUF to show that the presented countermeasure can withstand several attack scenarios.

2025

TCHES

Pushing The Area Limit of Composable Gadgets: Low-Area Hardware Masked Circuits with Fewer Sources of Randomness Abstract

Chaoran Wang Meiqin Wang Guowei Liu Lixuan Wu Yanhong Fan

With the dramatic increase of easily accessible IoT devices, there is a growing demand to protect these cryptographic hardware implementations against Side-Channel Analysis (SCA) attacks. Among various proposed countermeasures against SCA, masking is a widely adopted countermeasure. Constructing a correct and secure masking hardware scheme is a challenging task, even for experienced engineers. Composable gadgets have recently been proposed to facilitate the process of masking large circuits by using the free composition property. For the composable gadget design, besides composability, minimizing hardware overhead in the overall composable masking scheme is also an important factor. To reduce the area overhead, we propose first- and second-order composable gadgets based on a ring circuit design, named OBS. The design of the ring circuit reduces the number of registers and sources of randomness, thereby reducing the area of the gadgets. From the perspective of composing large masked circuits, we propose several optimization methods based on the characteristics of ring circuits, such as register optimization, frozen technique and bubble strategy. These optimization methods can further optimize the overall area of the masked circuit. Furthermore, we also provide the proof of the first- and second-order security of the OBS gadgets under the glitch- and transition-extended probe model. To show the area advantage of the OBS schemes, we give the are comparison results with other schemes at the gadget level and masked circuit level. The best optimization rate compared to the state-of-the-art can reach 40% for the AES S-box. The comparison results of different implementations show that our scheme outperforms various other composable masking schemes in terms of area overhead. We also use the formal verification tool SILVER and practical FPGA-based experiments to confirm the claimed first- and second-order security.

2025

TCHES

Quantum security analysis of Module-LWE PQC based on practical cost estimates Abstract

Seong-Min Cho Changyeol Lee Seung-Hyun Seo

The security of lattice-based cryptography relies on the computational complexity of solving the Shortest Vector Problem (SVP) on a high-dimensional lattice. Due to its efficacy in addressing SVP, lattice-based cryptographic systems have so far used the sieve algorithm to analyze their security. Previous works have analyzed the theoretical complexity improvement of the sieve algorithm in quantum computing environments, noting that Grover’s algorithm provides a quadratic speed-up for search problems. However, these works have solely focused on the theoretical analysis of query complexity, neglecting to present quantum circuit designs for sieves. Quantum circuit design and quantum resource estimation are necessary for practical analysis of the complexity of quantum sieves. Additionally, the cost of quantum error correction must also be considered, as quantum computation has a large number of errors. In this paper, we present quantum circuit designs for the sieve algorithm and provide estimates of the quantum resources required, including the number of gates and their depth. Furthermore, we evaluate the quantum sieve’s impact on the security level of ML-KEM and ML-DSA, comparing it to the classical sieve algorithm. We do this by evaluating the classical processing cost for quantum error correction using these estimates. Our results show that the quantum sieve algorithm does not break ML-KEM and ML-DSA, but it reduces their security level by 15 to 27 bits compared to the classical sieve.

2025

TCHES

REED: Chiplet-based Accelerator for Fully Homomorphic Encryption Abstract

Aikata Aikata Ahmet Can Mert Sunmin Kwon Maxim Deryabin Sujoy Sinha Roy

Fully Homomorphic Encryption (FHE) enables privacy-preserving computation and has many applications. However, its practical implementation faces massive computation and memory overheads. To address this bottleneck, several Application-Specific Integrated Circuit (ASIC) FHE accelerators have been proposed. All these prior works put every component needed for FHE onto one chip (monolithic), hence offering high performance. However, they encounter common challenges associated with large-scale chip design, such as inflexibility, low yield, and high manufacturing costs. In this paper, we present the first-of-its-kind multi-chiplet-based FHE accelerator ‘REED’ for overcoming the limitations of prior monolithic designs. To utilize the advantages of multi-chiplet structures while matching the performance of larger monolithic systems, we propose and implement several novel strategies in the context of FHE. These include a scalable chiplet design approach, an effective framework for workload distribution, a custom inter-chiplet communication strategy, and advanced pipelined Number Theoretic Transform and automorphism design to enhance performance.Our instruction-set and power simulations experiments with a prelayout netlist indicate that REED 2.5D microprocessor consumes 96.7mm2 chip area, 49.4Waverage power in 7nm technology. It could achieve a remarkable speedup of up to 2,991x compared to a CPU (24-core 2xIntel X5690) and offer 1.9x better performance, along with a 50% reduction in development costs when compared to state-of-the-art ASIC FHE accelerators. Furthermore, our work presents the first instance of benchmarking an encrypted deep neural network (DNN) training. Overall, the REED architecture design offers a highly effective solution for accelerating FHE, thereby significantly advancing the practicality and deployability of FHE in real-world applications.

2025

TCHES

Rejected Signatures’ Challenges Pose New Challenges: Key Recovery of CRYSTALS-Dilithium via Side-Channel Attacks Abstract

Yuanyuan Zhou Weijia Wang Yiteng Sun Yu Yu

Rejection sampling is a crucial security mechanism in lattice-based signature schemes that follow the Fiat-Shamir with aborts paradigm, such as MLDSA/ CRYSTALS-Dilithium. This technique transforms secret-dependent signature samples into ones that are statistically close to a secret-independent distribution (in the random oracle model). While many side-channel attacks have directly targeted sensitive data such as nonces, secret keys, and decomposed commitments, fewer studies have explored the potential leakage associated with rejection sampling. Notably, at HOST 2021, Karabulut et al. showed that leakage from rejected signatures’ challenges can undermine, but not entirely break, the security of the Dilithium scheme.Motivated by the above, we convert the problem of key recovery (from the leakage of rejection sampling) to an integer linear programming problem (ILP), where rejected responses of unique Hamming weights set upper/lower constraints of the product between the challenge and the private key. We formally study the worst-case complexity of the problem as well as empirically confirm the practicality of the rejected signature’s challenge attack. For all three security levels of Dilithium-2/3/5, our attack recovers the private key in seconds or minutes with a 100% Success Rate (SR).Our attack leverages knowledge of the rejected signature’s challenge and response, and thus we propose methods to extract this information by exploiting single-trace sidechannel leakage from Number Theoretic Transform (NTT) operations and functions associated with the response generation procedure. We demonstrate the practicality of this rejected signature’s challenge attack by using real power consumption on an ARM Cortex-M4 microcontroller. To the best of our knowledge, it is the first practical and efficient side-channel key recovery attack on ML-DSA/Dilithium that targets the rejection sampling procedure. Furthermore, we discuss some countermeasures to mitigate this security issue.

2025

TCHES

Rudraksh: A compact and lightweight post-quantum key-encapsulation mechanism Abstract

Suparna Kundu Archisman Ghosh Angshuman Karmakar Shreyas Sen Ingrid Verbauwhede

Resource-constrained devices such as wireless sensors and Internet of Things (IoT) devices have become ubiquitous in our digital ecosystem. These devices generate and handle a major part of our digital data. However, due to the impending threat of quantum computers on our existing public-key cryptographic schemes and the limited resources available on IoT devices, it is important to design lightweight post-quantum cryptographic (PQC) schemes suitable for these devices.In this work, we explored the design space of learning with error-based PQC schemes to design a lightweight key-encapsulation mechanism (KEM) suitable for resourceconstrained devices. We have done a scrupulous and extensive analysis and evaluation of different design elements, such as polynomial size, field modulus structure, reduction algorithm, and secret and error distribution of an LWE-based KEM. Our explorations led to the proposal of a lightweight PQC-KEM, Rudraksh, without compromising security. Our scheme provides security against chosen ciphertext attacks (CCA) with more than 100 bits of Core-SVP post-quantum security and belongs to the NIST-level-I security category (provide security at least as much as AES-128). We have also shown how ASCON can be used for lightweight pseudo-random number generation and hash function in the lattice-based KEMs instead of the widely used Keccak for lightweight design. Our FPGA results show that Rudraksh currently requires the least area among the PQC KEMs of similar security. Our implementation of Rudraksh provides a ~3x improvement in terms of the area requirement compared to the state-of-the-art areaoptimized implementation of Kyber, can operate at 63%-76% higher frequency with respect to high-throughput Kyber, and improves time-area-product ~2x compared to the state-of-the-art compact implementation of Kyber published in HPEC 2022.

2025

TCHES

Scoop: An Optimization Algorithm for Profiling Attacks against Higher-Order Masking Abstract

Nathan Rousselot Karine Heydemann Loïc Masure Vincent Migairou

In this paper we provide new theoretical and empirical evidences that gradient-based deep learning profiling attacks (DL-SCA) suffer from masking schemes. This occurs through an initial stall of the learning process: the so-called plateau effect. To understand why, we derive an analytical expression of a DL-SCA model targeting simulated traces which enables us to study an analytical expression of the loss. By studying the loss landscape of this model, we show that not only do the magnitudes of the gradients decrease as the order of masking increases, but the loss landscape also exhibits a prominent saddle point interfering with the optimization process. From these observations, we (1) propose the usage of a second-order optimization algorithm mitigating the impact of low-gradient areas. In addition, we show how to leverage the intrinsic sparsity of valuable information in SCA traces to better pose the DL-SCA problem. To do so, we (2) propose to use the implicit regularization properties of the sparse mirror descent. These propositions are gathered in a new publicly available optimization algorithm, Scoop. Scoop combines second-order derivative of the loss function in the optimization process, with a sparse stochastic mirror descent. We experimentally show that Scoop pushes further the current limitations of DL-SCA against simulated traces, and outperforms the state-of-theart on the ASCADv1 dataset in terms of number of traces required to retrieve the key, perceived information and plateau length. Scoop also performs the first nonworst- case attack on the ASCADv2 dataset. On simulated traces, we show that using Scoop reduces the DL-SCA time complexity by the equivalent of one masking order.

2025

TCHES

SeaFlame: Communication-Efficient Secure Aggregation for Federated Learning against Malicious Entities Abstract

Jinling Tang Haixia Xu Huimei Liao Yinchang Zhou

Secure aggregation is a popular solution to ensuring privacy for federated learning. However, when considering malicious participants in secure aggregation, it is difficult to achieve both privacy and high efficiency. Therefore, we propose SeaFlame, a communication-efficient secure aggregation protocol against malicious participants. Inspired by the state-of-the-art work, ELSA, SeaFlame also utilizes two non-colluding servers to ensure privacy against malicious entities and provide defenses against boosted gradients. Crucially, to improve communication efficiency, SeaFlame uses arithmetic sharing together with arithmetic-to-arithmetic share conversion to reduce client communication, and uses the random linear combination to reduce server communication.Security analysis proves that our SeaFlame guarantees privacy against malicious clients colluding with one malicious server. Experimental evaluation demonstrates that, compared to ELSA, SeaFlame optimizes communication by up to 10.5, 6.00, and 8.17 times for a client, a server, and all entities, at the expense of 1.25-1.86 times additional end-to-end runtime.

2025

TCHES

Secure and efficient transciphering for FHE-based MPC Abstract

Diego F. Aranha Antonio Guimarães Clément Hoffmann Pierrick Méaux

Transciphering (or Hybrid-Homomorphic Encryption, HHE) is an established technique for avoiding ciphertext expansion in HE applications, saving communication and storage resources. Recently, it has also been shown to be a fundamental component in the practical construction of HE-based multi-party computation (MPC) protocols, being used both for input data and intermediary results (Smart, IMACC 2023). In these protocols, however, ciphers are used with keys that are jointly generated by multiple (possibly malicious) parties, which may require additional security assumptions that have been so far overlooked in the HHE literature. In this paper, we formalize this issue as a security against related-key attacks (RKA) problem and provide efficient solutions for it. We start by presenting an efficient method for homomorphically evaluating Mixed-Filter-Permutator (MFP) ciphers in leveled mode, enabling speedups of up to thousands of times compared to previous literature. For the multi-party scenario, we focus specifically on the Margrethe cipher (Hoffmann et al., INDOCRYPT 2023). We show that, contrary to other commonly used HHE ciphers (e.g. FLIP), Margrethe is out-of-the-box secure for any protocols that allow malicious parties to learn up to two related key streams, enabling security for the vast majority of static MPC protocols. For other cases, we quantify the loss of security based on the number of related key streams (which often depends on the number of malicious parties and specific protocol). Performance-wise, our implementation of Margrethe takes just 3.9 ms to transcipher 4-bit messages, being significantly faster than the state of the art in terms of latency.

2025

TCHES

Sharing the Mask: TFHE Bootstrapping on Packed Messages Abstract

Loris Bergerat Charlotte Bonte Benjamin R. Curtis Jean-Baptiste Orfila Pascal Paillier Samuel Tap

Fully Homomorphic Encryption (FHE) schemes typically experience significant data expansion during encryption, leading to increased computational costs and memory demands during homomorphic evaluations compared to their plaintext counterparts. This work builds upon prior methods aimed at reducing ciphertext expansion by leveraging matrix secrets under the Matrix-LWE assumption. In particular, we consider a ciphertext format referred to in this work as common mask (CM) ciphertexts, which comprises a shared mask and multiple message bodies. Each body encrypts a distinct message while reusing the common random mask. We demonstrate that all known FHEW/TFHE-style ciphertext variants and operations can be naturally extended to this CM format. Our benchmarks highlight the potential for amortizing operations using the CM structure, significantly reducing overhead. For instance, in the boolean setting, we have up to a 51% improvement when packing 8 messages. Beyond ciphertext compression and amortized evaluations, the CM format also enables the generalization of several core-TFHE operations. Specifically, we support applying distinct lookup tables on different encrypted messages within a single CM ciphertext and private linear operations on messages encrypted within the same CM ciphertext.

2025

TCHES

Shortcut2Secrets: A Table-based Differential Fault Attack Framework Abstract

Weizhe Wang Pierrick Méaux Deng Tang

Recently, Differential Fault Attacks (DFAs) have proven highly effective against stream ciphers designed for Hybrid Homomorphic Encryption (HHE). In this work, we present a table-based DFA framework called the shortcut attack, which generalizes the attack proposed by Wang and Tang on the cipher Elisabeth. The framework applies to a broad sub-family of ciphers following the Group Filter Permutator (GFP) paradigm and enhances previous DFAs by improving both the fault identification and path generation steps. Notably, the shortcut attack circumvents the issue of function representation, allowing successful attacks even when the cipher’s filter function cannot be represented over the ring it is defined on.Additionally, we provide complexity estimates for the framework and apply the shortcut attack to Elisabeth-4 and its patches. As a result, we optimize the DFA on Elisabeth-4, requiring fewer keystreams and running faster than previous methods. Specifically, we achieve a DFA that requires only 3000 keystreams, which is one-fifth of the previous best result. We also successfully mount a practical DFA on Gabriel-4 and provide a theoretical DFA for Elisabeth-b4.For the latest patch, Margrethe-18-4, which follows the more general Mixed Filter Permutator (MFP) paradigm, we present a DFA in a stronger model. To the best of our knowledge, these are the first DFA results on the patches of Elisabeth-4. Finally, we derive security margins to prevent shortcut attacks on a broad sub-family of MFP ciphers, which can serve as parameter recommendations for designers.

2025

TCHES

Sieving with Streaming Memory Access Abstract

Ziyu Zhao Jintai Ding Bo-Yin Yang

We implement an optimized BGJ (Becker–Gama–Joux 2015) sieve and analyze its behavior in a study of RAM access overheads (and their minimization) in sieving algorithms for large lattice problems. Both experiment and theory points to BGJ’s inherent structure being much more memory-efficient than the BDGL (Becker–Ducas– Gama–Laahoven 2016) sieve, which uses asymptotically the fewest logical operations. In particular, a dimension-n BGJ sieve uses only 20.2075n+o(n) streaming (non-random) main memory accesses. A key insight: Bucket sizes decrease by orders of magnitude after each BGJ filtering layer, so that sub-buckets fit into successively much smaller (hence faster) storage areas. Our refined BGJ is competitive at cryptographic sizes and should outperform BDGL for all practically achievable dimensions.The above is corroborated by the results from our efficient CPU-based BGJ implementation in an optimized framework, which saves about 40% RAM footprint and is ≥ 24.5x more efficient gate-count-wise compared to the Ducas–Stevens–van Woerden 2021 4-GPU implementation, which like most prior sieving-based SVP computations is a HK3 (Herold–Kirshanova 2017) sieve. Notably, we solved the 183-dimensional SVP Darmstadt Challenge in 30 days on a 112-core server and 0.87 TB of RAM; similarly we also found a short vector in the 796-dimensional Ideal-SVP Challenge. Our implementation may offer further insights into the behavior of asymptotically “fast” sieving algorithms when applied to large-scale problems. Moreover, our refined cost estimation of SVP based on this implementation suggests that some NIST PQC candidates (e.g. Falcon-512), are not sure to meet NIST’s security requirements.

2025

TCHES

SimdMSM: SIMD-accelerated Multi-Scalar Multiplication Framework for zkSNARKs Abstract

Rui Jiang Cong Peng Min Luo Rongmao Chen Debiao He

Multi-scalar multiplication (MSM) is the primary building block in many pairing-based zero-knowledge proof (ZKP) systems. MSM at large scales has become the main bottleneck in ZKP implementations. Inspired by existing SIMD-accelerated work, we are focused on accelerating MSM computing efficiency using SIMD instructions in a single CPU environment. First, we propose a SIMD-accelerated MSM computing architecture with no write conflicts and constant memory overheads. This architecture utilizes multithreading to achieve task-level and loop-level parallelism and employs a three-tier buffer mechanism to maximize the utilization of the SIMD engine. Instanced with AVX512-IFMA instructions, we implement six SIMD elliptic curve arithmetic engines for different point addition in three coordinate systems and two groups. Moreover, we integrate our AVX-MSM implementation into the libsnark library, naming it AVX-ZK. In more detail, point deduplication and “Three-Stage” memory optimization are proposed to address problems existing in practical applications. Based on the RELIC library, our performance results on the BLS12-381 curve show that our AVX-MSM achieves up to 27.86x speedup over the most popular Pippenger algorithm. Compared with libsnark, our AVX-ZK implementation achieves over 11.53x (up to 20.26x) speedup under standard benchmarks.

2025

TCHES

Skyscraper: Fast Hashing on Big Primes Abstract

Clémence Bouvier Lorenzo Grassi Dmitry Khovratovich Katharina Koschatko Christian Rechberger Fabian Schmid Markus Schofnegger

Arithmetic hash functions defined over prime fields have been actively developed and used in verifiable computation (VC) protocols. Among those, ellipticcurve- based SNARKs require large (256-bit and higher) primes. Such hash functions are notably slow, losing a factor of up to 1000 compared to regular constructions like SHA-2/3.In this paper, we present the hash function Skyscraper, which is aimed at large prime fields and provides major improvements compared to Reinforced Concrete and Monolith. First, the design is exactly the same for all large primes, which simplifies analysis and deployment. Secondly, it achieves a performance comparable to cryptographic hash standards by using low-degree non-invertible transformations and minimizing modulo reductions. Concretely, it hashes two 256-bit prime field (BLS12-381 curve scalar field) elements in 135 nanoseconds, whereas SHA-256 needs 42 nanoseconds on the same machine.The low circuit complexity of Skyscraper, together with its high native speed, should allow a substantial reduction in many VC scenarios, particularly in recursive proofs.

2025

TCHES

SoK: FHE-Friendly Symmetric Ciphers and Transciphering Abstract

Chao Niu Benqiang Wei Zhicong Huang Zhaomin Yang Cheng Hong Meiqin Wang Tao Wei

Fully Homomorphic Encryption (FHE) enables computation on encrypted data without decryption, demonstrating significant potential for privacy-preserving applications. However, FHE faces several challenges, one of which is the significant plaintext-to-ciphertext expansion ratio, resulting in high communication overhead between client and server. The transciphering technique can effectively address this problem by first encrypting data with a space-efficient symmetric cipher, then converting symmetric ciphertext to FHE ciphertext without decryption.Numerous FHE-friendly symmetric ciphers and transciphering methods have been developed by researchers, each with unique advantages and limitations. These often require extensive knowledge of both symmetric cryptography and FHE to fully grasp, making comparison and selection among these schemes challenging. To address this, we conduct a comprehensive survey of over 20 FHE-friendly symmetric ciphers and transciphering methods, evaluating them based on criteria such as security level, efficiency, and compatibility. We have designed and executed experiments to benchmark the performance of the feasible combinations of symmetric ciphers and transciphering methods across various application scenarios. Our findings offer insights into achieving efficient transciphering tailored to different task contexts. Additionally, we make our example code available open-source, leveraging state-of-the-art FHE implementations.

2025

TCHES

Tailorable codes for lattice-based KEMs with applications to compact ML-KEM instantiations Abstract

Thales B. Paiva Marcos A. Simplicio Jr Syed Mahbub Hafiz Bahattin Yildiz Eduardo L. Cominetti Henrique S. Ogawa

Compared to elliptic curve cryptography, a primary drawback of latticebased schemes is the larger size of their public keys and ciphertexts. A common procedure for compressing these objects consists essentially of dropping some of their least significant bits. Albeit effective for compression, there is a limit to the number of bits to be dropped before we get a noticeable decryption failure rate (DFR), which is a security concern. To address this issue, this paper presents a family of error-correction codes that, by allowing an increased number of dropped bits while preserving a negligible DFR, can be used for both ciphertext and publickey compression in modern lattice-based schemes. To showcase the impact and practicality of our proposal, we use the highly optimized ML-KEM, a post-quantum lattice-based scheme recently standardized by NIST. We provide detailed procedures for tailoring our codes to ML-KEM’s specific noise distributions, and show how to analyze the DFR without independence assumptions on the noise coefficients. Among our results, we achieve between 4% and 8% ciphertext compression for MLKEM. Alternatively, we obtain 8% shorter public keys compared to the current standard. We also present isochronous implementations of the decoding procedure, achieving negligible performance impact in the full ML-KEM decapsulation even when considering optimized implementations for AVX2, Cortex-M4, and Cortex-A53.

2025

TCHES

TESLA: Trusted Execution Support for Legacy Embedded Applications Abstract

Saltanat Firdous Allaqband Asutosh Brahma Sai Venkata Krishnan V Arjun Menon Chester Rebeiro

Legacy applications continue to be widely used in embedded systems, despite high maintenance costs, primarily due to the challenges involved in modifying them. Traditional Trusted Execution Environments (TEEs), though valuable for securing sensitive computations, fall short in supporting these legacy workloads. Most existing TEEs require significant application modifications, or incur high system call overheads. Additionally, TEEs often enforce fixed enclave sizes failing to accommodate the dynamic memory needs of applications. Many do not consider the security of I/O operations, and those that do, expand the Trusted Computing Base (TCB) significantly, weakening the TEE.We present TESLA, a novel TEE architecture designed to natively support the execution of unmodified legacy applications on embedded systems. TESLA introduces Fluid Enclaves, which dynamically adjust enclave sizes based on the application’s runtime memory requirements. To minimize system call overheads, TESLA introduces Enclave Windows that permit an untrusted Operating System temporary access to system call parameters within the enclave. TESLA also ensures confidentiality and integrity of I/O data exchanged between enclaves and peripherals. We have implemented a prototype of TESLA on a RISC-V processor running the Linux kernel, synthesizing it on an FPGA to demonstrate its feasibility. The evaluation quantifies the hardware and runtime performance overheads, demonstrating TESLA’s practicality and effectiveness in overcoming key limitations of existing TEEs.

2025

TCHES

TFHE Gets Real: an Efficient and Flexible Homomorphic Floating-Point Arithmetic Abstract

Loris Bergerat Ilaria Chillotti Damien Ligier Jean-Baptiste Orfila Samuel Tap

Floating-point arithmetic plays a central role in computer science and is used in various domains where precision and computational scale are essential. One notable application is in machine learning, where Fully Homomorphic Encryption (FHE) can play a crucial role in safeguarding user privacy. In this paper, we focus on TFHE and develop novel homomorphic operators designed to enable the construction of precise and adaptable homomorphic floating-point operations. Integrating floating-point arithmetic within the context of FHE is particularly challenging due to constraints such as small message space and the lack of information during computation. Despite these challenges, we were able to determine parameters for common precisions (e.g., 32-bit, 64-bit) and achieve remarkable computational speeds, with 32-bit floating-point additions completing in 2.5 seconds and multiplications in approximately 1 second in a multi-threaded environment. These metrics provide empirical evidence of the efficiency and practicality of our proposed methods, which significantly outperform previous efforts. Our results demonstrate a significant advancement in the practical application of FHE, making it more viable for real-world scenarios and bridging the gap between theoretical encryption techniques and practical usability.

2025

TCHES

ToFA: Towards Fault Analysis of GIFT and GIFT-like Ciphers Leveraging Truncated Impossible Differentials Abstract

Anup Kumar Kundu Shibam Ghosh Aikata Aikata Dhiman Saha

In this work, we introduce ToFA, the first fault attack (FA) strategy that attempts to leverage the classically well-known idea of impossible differential cryptanalysis to mount practically verifiable attacks on bit-oriented ciphers like GIFT and BAKSHEESH. The idea stems from the fact that truncated differential paths induced due to fault injection in certain intermediate rounds of the ciphers lead to active SBox-es in subsequent rounds whose inputs admit specific truncated differences. This leads to a (multi-round) impossible differential distinguisher, which can be incrementally leveraged for key-guess elimination via partial decryption. The key-space reduction further exploits the multi-round impossibility, capitalizing on the relations due to the quotient-remainder (QR) groups of the GIFT and BAKSHEESH linear layer, which increases the filtering capability of the distinguisher. Moreover, the primary observations made in this work are independent of the actual SBox. Clock glitch based fault attacks were mounted on 8-bit implementations of GIFT- 64/GIFT-128 using a ChipWhisperer Lite board on an 8-bit ATXmega128D4-AU micro-controller. Unique key recovery was achieved for GIFT-128 with 3 random byte faults, while for GIFT-64, key space was reduced to 232, the highest achievable for GIFT-64, with a single level fault due to its key-schedule. To the best of our knowledge, this work also reports the highest fault injection penetration for any variant of GIFT and BAKSHEESH. Finally, this work reiterates the role of classical cryptanalysis strategies in fault vulnerability assessment by showcasing the most efficient fault attacks on GIFT.

2025

TCHES

TREE: Bridging the gap between reconfigurable computing and secure execution Abstract

Sérgio Pereira Tiago Gomes Jorge Cabral Sandro Pinto

Trusted Execution Environments (TEEs) have become a pivotal technology for securing a wide spectrum of security-sensitive applications. With modern computing systems shifting to heterogeneous architectures, integrating TEE support into these systems is paramount. One promising line of research has proposed leveraging FPGA technology to provide promising TEE solutions. Despite their potential, current implementations of FPGA-based TEEs have a set of drawbacks. Some solutions (i.e., MeetGo and ShEF) prioritize the secure loading of reconfigurable modules but lack compatibility with established legacy TEE specifications and services. On the other hand, those that aim to establish legacy compatibility (i.e., TEEOD and BYOTee) fail to fully utilize the dynamic reconfigurability and parallel processing capabilities inherent in FPGAs. In this context, we introduce Trusted Reconfigurable Execution Environments (TREE), a novel framework that fulfills the gaps existing in current FPGA-based TEE approaches. TREE enables system designers to fully leverage the reconfigurability capabilities of FPGAs without compromising compatibility with existing TEE specifications. Our reference TREE implementation ensures secure execution of user-customized hardware, legacy software trusted applications (TAs), and TAs that combine both custom hardware and software components, by fully exploiting the FPGA’s dynamic partial reconfiguration capabilities. TREE’s root of trust relies on conventional SoC-FPGA mechanisms including secure initial reconfiguration and memory protection, to ensure the initial bitstream integrity is kept after loaded and that reconfiguration access is restricted to the FPGA fabric after boot. Additionally, TREE provides essential TEE services within the FPGA fabric, including secure storage and cryptographic functions, enabling TAs to securely store sensitive data and perform critical operations in an isolated environment. Our evaluation on an entry-level FPGA, involved assessing TREE using microbenchmarks and real-world applications to compare its hardware costs and performance speedups against OP-TEE. The results showed that TREE’s hardware costs are minimal, while it achieves significant performance speedups, especially when compared to hardware TAs. For empirical demonstrations, we assess two real-world TA examples on TREE: an access control authenticator and a Bitcoin wallet.

2025

TCHES

VeloFHE: GPU Acceleration for FHEW and TFHE Bootstrapping Abstract

Shiyu Shen Hao Yang Zhe Liu Ying Liu Xianhui Lu Wangchen Dai Lu Zhou Yunlei Zhao Ray C. C. Cheung

Bit-wise Fully Homomorphic Encryption schemes like FHEW and TFHE offer efficient functional bootstrapping, enabling concurrent function evaluation and noise reduction. While advantageous for secure computations, these schemes suffer from high data expansion, posing significant performance challenges in practical ap- plications due to massive ciphertexts. To address these issues, we propose VeloFHE, a CUDA-accelerated design to enhance the efficiency of FHEW and TFHE schemes on GPUs. We develop a novel hybrid four-step Number Theoretic Transform (NTT) approach for fast polynomial multiplication. By decomposing large-scale NTTs into highly parallelizable submodules, incorporating cyclic and negacyclic convolutions, and introducing several memory-oriented optimizations, we significantly reduce both the computational complexity and memory requirements. For blind rotation, besides the gadget decomposition approach, we also apply a recent proposed modulus raising technique to both schemes to alleviate memory pressure. We further optimize it by refining computational flow to reduce noise from scaling and maintain accumulator compatibility. For key switching, we address input-output parallelism mismatches, and offloading suitable computations to the CPU, effectively hiding latency through asynchronous execution. Additionally, we explore batching in bootstrapping, de- veloping a general framework that accommodates both schemes with either gadget decomposition or modulus raising method.Our experimental results demonstrate significant performance improvements. The proposed NTT implementation shows over 35% improvement compared to recent GPU implementations. On an RTX 4090 GPU, we achieve speedups of 371.86x and 390.44x for FHEW and TFHE gate bootstrapping, respectively, compared to OpenFHE running on a 48-thread CPU at a 128-bit security level. The corresponding throughputs are 7,007 and 11,378 operations per second. Furthermore, relative to the state-of-the-art GPU implementation [XLK+25], our approach provides speedups of 2.56x, 2.24x, and 2.33x for TFHE gate bootstrapping, homomorphic evaluation of arbitrary functions, and homomorphic flooring operation, respectively. Our VeloFHE surpasses some current hardware designs, offering an effective solution for more practical and efficient privacy-preserving computations.

2025

TCHES

VIMA: A Privacy-Preserving Integrity Measurement Architecture for Containerized Environments Abstract

Omar Jarkas Ryan K L Ko Naipeng Dong Redowan Mahmud

Integrity verification and attestation are critical in containerized environments, where traditional Linux Integrity Measurement Architecture (IMA) falls short due to its lack of container-specific contextualization. These gaps undermine container autonomy, escalate privacy risks, and impede granular integrity checks. Addressing these challenges, this paper introduces the Virtual IMA (VIMA), a novel framework that refines Linux IMA’s principles to support containerized settings. Using nested Merkle trees, VIMA’s Two-Tree Architecture (2TA) enables detailed integrity assessments across system-wide monolithic trees and individual container trees. Integrating Merkle and zero-knowledge (ZK) proofs establishes VIMA as a secure, privacy-preserving verification and attestation solution. Our comparative analysis and initial prototype testing reveal that VIMA significantly improves upon traditional IMA with minimal performance overhead, offering substantial scope for optimization.

2025

TCHES

XBOOT: Free-XOR Gates for CKKS with Applications to Transciphering Abstract

Chao Niu Zhicong Huang Zhaomin Yang Yi Chen Liang Kong Cheng Hong Tao Wei

The CKKS scheme is traditionally recognized for approximate homomorphic encryption of real numbers, but BLEACH (Drucker et al., JoC 2024) extends its capabilities to handle exact computations on binary or small integer numbers.Despite this advancement, BLEACH’s approach of simulating XOR gates via (a−b)2 incurs one multiplication per gate, which is computationally expensive in homomorphic encryption. To this end, we introduce XBOOT, a new framework built upon BLEACH’s blueprint but allows for almost free evaluation of XOR gates. The core concept of XBOOT involves lazy reduction, where XOR operations are simulated with the less costly addition operation, a+b, leaving the management of potential overflows to later stages. We carefully handle the modulus chain and scale factors to ensure that the overflows are managed during the CKKS bootstrapping phase, preserving the correct XOR result without extra cost. We use AES-CKKS transciphering as a benchmark to test the capability of XBOOT, and achieve a throughput exceeding one kilobyte per second, which represents a 2.5x improvement over the state-of-the-art (Aharoni et al., HES 2023). Moreover, XBOOT enables the practical execution of tasks with extensive XOR operations that were previously challenging for CKKS. For example, we can do Rasta-CKKS transciphering at over two kilobytes per second, more than 10x faster than the baseline without XBOOT.

International Association for Cryptologic Research

International Association
for Cryptologic Research

CryptoDB