A Closer Look at the Chaotic Ring Oscillators based TRNG Design Abstract
TRNG is an essential component for security applications. A vulnerable TRNG could be exploited to facilitate potential attacks or be related to a reduced key space, and eventually results in a compromised cryptographic system. A digital FIRO-/GARO-based TRNG with high throughput and high entropy rate was introduced by Jovan Dj. Golic (TC’06). However, the fact that periodic oscillation is a main failure of FIRO-/GARO-based TRNGs is noticed in the paper (Markus Dichtl, ePrint’15). We verify this problem and estimate the consequential entropy loss using Lyapunov exponents and the test suite of the NIST SP 800-90B standard. To address the problem of periodic oscillations, we propose several implementation guidelines based on a gate-level model, a design methodology to build a reliable GARO-based TRNG, and an online test to improve the robustness of FIRO-/GARO-based TRNGs. The gate-level implementation guidelines illustrate the causes of periodic oscillations, which are verified by actual implementation and bifurcation diagram. Based on the design methodology, a suitable feedback polynomial can be selected by evaluating the feedback polynomials. The analysis and understanding of periodic oscillation and FIRO-/GARO-based TRNGs are deepened by delay adjustment. A TRNG with the selected feedback polynomial may occasionally enter periodic oscillations, due to active attacks and the delay inconstancy of implementations. This inconstancy might be caused by self-heating, temperature and voltage fluctuation, and the process variation among different silicon chips. Thus, an online test module, as one indispensable component of TRNGs, is proposed to detect periodic oscillations. The detected periodic oscillation can be eliminated by adjusting feedback polynomial or delays to improve the robustness. The online test module is composed of a lightweight and responsive detector with a high detection rate, outperforming the existing detector design and statistical tests. The areas, power consumptions and frequencies are evaluated based on the ASIC implementations of a GARO, the sampling circuit and the online test module. The gate-level implementation guidelines promote the future establishment of the stochastic model of FIRO-/GARO-based TRNGs with a deeper understanding.
CFNTT: Scalable Radix-2/4 NTT Multiplication Architecture with an Efficient Conflict-free Memory Mapping Scheme Abstract
Number theoretic transform (NTT) is widely utilized to speed up polynomial multiplication, which is the critical computation bottleneck in a lot of cryptographic algorithms like lattice-based post-quantum cryptography (PQC) and homomorphic encryption (HE). One of the tendency for NTT hardware architecture is to support diverse security parameters and meet resource constraints on different computing platforms. Thus flexibility and Area-Time Product (ATP) become two crucial metrics in NTT hardware design. The flexibility of NTT in terms of different vector sizes and moduli can be obtained directly. Whereas the varying strides in memory access of in-place NTT render the design for different radix and number of parallel butterfly units a tough problem. This paper proposes an efficient conflict-free memory mapping scheme that supports the configuration for both multiple parallel butterfly units and arbitrary radix of NTT. Compared to other approaches, this scheme owns broader applicability and facilitates the parallelization of non-radix-2 NTT hardware design. Based on this scheme, we propose a scalable radix-2 and radix-4 NTT multiplication architecture by algorithm-hardware co-design. A dedicated schedule method is leveraged to reduce the number of modular additions/subtractions and modular multiplications in radix-4 butterfly unit by 20% and 33%, respectively. To avoid the bit-reversed cost and save memory footprint in arbitrary radix NTT/INTT, we put forward a general method by rearranging the loop structure and reusing the twiddle factors. The hardware-level optimization is achieved by excavating the symmetric operators in radix-4 butterfly unit, which saves almost 50% hardware resources compared to a straightforward implementation. Through experimental results and theoretical analysis, we point out that the radix-4 NTT with the same number of parallel butterfly units outperforms the radix-2 NTT in terms of area-time performance in the interleaved memory system. This advantage is enlarged when increasing the number of parallel butterfly units. For example, when processing 1024 14-bit points NTT with 8 parallel butterfly units, the ATP of LUT/FF/DSP/BRAM n radix-4 NTT core is approximately 2.2 × /1.2 × /1.1 × /1.9 × less than that of the radix-2 NTT core on a similar FPGA platform.
A Compact and High-Performance Hardware Architecture for CRYSTALS-Dilithium Abstract
The lattice-based CRYSTALS-Dilithium scheme is one of the three thirdround digital signature finalists in the National Institute of Standards and Technology Post-Quantum Cryptography Standardization Process. Due to the complex calculations and highly individualized functions in Dilithium, its hardware implementations face the problems of large area requirements and low efficiency. This paper proposes several optimization methods to achieve a compact and high-performance hardware architecture for round 3 Dilithium. Specifically, a segmented pipelined processing method is proposed to reduce both the storage requirements and the processing time. Moreover, several optimized modules are designed to improve the efficiency of the proposed architecture, including a pipelined number theoretic transform module, a SampleInBall module, a Decompose module, and three modular reduction modules. Compared with state-of-the-art designs for Dilithium on similar platforms, our implementation requires 1.4×/1.4×/3.0×/4.5× fewer LUTs/FFs/BRAMs/DSPs, respectively, and 4.4×/1.7×/1.4× less time for key generation, signature generation, and signature verification, respectively, for NIST security level 5.
Highly Efficient Architecture of NewHope-NIST on FPGA using Low-Complexity NTT/INTT 📺 Abstract
NewHope-NIST is a promising ring learning with errors (RLWE)-based postquantum cryptography (PQC) for key encapsulation mechanisms. The performance on the field-programmable gate array (FPGA) affects the applicability of NewHope-NIST. In RLWE-based PQC algorithms, the number theoretic transform (NTT) is one of the most time-consuming operations. In this paper, low-complexity NTT and inverse NTT (INTT) are used to implement highly efficient NewHope-NIST on FPGA. First, both the pre-processing of NTT and the post-processing of INTT are merged into the fast Fourier transform (FFT) algorithm, which reduces N and 2N modular multiplications for N-point NTT and INTT, respectively. Second, a compact butterfly unit and an efficient modular reduction on the modulus 12289 are proposed for the low-complexity NTT/INTT architecture, which achieves an improvement of approximately 3× in the area time product (ATP) compared with the results of the state-of-the-art designs. Finally, a highly efficient architecture with doubled bandwidth and timing hiding for NewHope-NIST is presented. The implementation results on an FPGA show that our design is at least 2.5× faster and has 4.9× smaller ATP compared with the results of the state-of-the-art designs of NewHope-NIST on similar platforms.
ES-TRNG: A High-throughput, Low-area True Random Number Generator based on Edge Sampling Abstract
In this paper we present a novel true random number generator based on high-precision edge sampling. We use two novel techniques to increase the throughput and reduce the area of the proposed randomness source: variable-precision phase encoding and repetitive sampling. The first technique consists of encoding the oscillator phase with high precision in the regions around the signal edges and with low precision everywhere else. This technique results in a compact implementation at the expense of reduced entropy in some samples. The second technique consists of repeating the sampling at high frequency until the phase region encoded with high precision is captured. This technique ensures that only the high-entropy bits are sent to the output. The combination of the two proposed techniques results in a secure TRNG, which suits both ASIC and FPGA implementations. The core part of the proposed generator is implemented with 10 look-up tables (LUTs) and 5 flip-flops (FFs) of a Xilinx Spartan-6 FPGA, and achieves a throughput of 1.15 Mbps with 0.997 bits of Shannon entropy. On Intel Cyclone V FPGAs, this implementation uses 10 LUTs and 6 FFs, and achieves a throughput of 1.07 Mbps. This TRNG design is supported by a stochastic model and a formal security evaluation.
- CHES 2022
- CHES 2021
- Chen Chen (1)
- Xiangren Chen (1)
- Milos Grujic (1)
- Zhengdong Li (1)
- Leibo Liu (4)
- Nele Mentens (1)
- Vladimir Rozic (2)
- Shuqin Su (1)
- Ingrid Verbauwhede (1)
- Hanning Wang (1)
- Shaojun Wei (4)
- Mingyuan Yang (1)
- Shouyi Yin (3)
- Neng Zhang (2)
- Cankun Zhao (1)
- Wenping Zhu (1)
- Min Zhu (2)