Julio López

CryptoDB

Julio López

Publications and invited talks

Year

Venue

Title

2025

TCHES

Generation of Fast Finite Field Arithmetic forCortex-M4 with ECDH and SQIsign Applications Abstract

Felix Carvalho Rodrigues Décio Gazzoni Filho Gora Adj Isaac A. Canales-Martínez Jorge Chávez-Saab Julio López Michael Scott Francisco Rodríguez-Henríquez

Finite field arithmetic is central to several cryptographic algorithms on embedded devices like the ARM Cortex-M4, particularly for elliptic curve and isogenybased cryptography. However, rapid algorithm evolution, driven by initiatives such as NIST’s post-quantum standardization, might frequently render hand-optimized implementations obsolete. We address this challenge with m4-modarith, a library generating C code with inline assembly for the Cortex-M4 that rivals custom-tuned assembly, enabling agile development in this ever-changing landscape. Our generated modular multiplications obtains fast performances, competitive with hand-optimized assembly implementations published in the literature, even outperforming some of them for Curve25519. Two contributions are pivotal to this success. First, we introduce a novel multiplication strategy that matches the memory access complexity of the operand caching method while being applicable to a larger cache size for Cortex-M4 implementations. Second, we generalize an efficient pseudo-Mersenne reduction strategy, and formally prove its correctness and applicability for most primes of cryptographic interest. Our generator allowed agile optimization of SQIsign’s NIST PQC Round 2 submission, improving level 1 verification from 123 Mcycles to only 54 Mcycles, a 2.3x speedup. As an additional case study, we use our generator to improve performance of portable implementations of RFC 7748 by up to 2.2x.

2024

CIC

Fast polynomial multiplication using matrix multiplication accelerators with applications to NTRU on Apple M1/M3 SoCs Abstract

Décio Luiz Gazzoni Filho Guilherme Brandão Julio López

Efficient polynomial multiplication routines are critical to the performance of lattice-based post-quantum cryptography (PQC). As PQC standards only recently started to emerge, CPUs still lack specialized instructions to accelerate such routines. Meanwhile, deep learning has grown immeasurably in importance. Its workloads call for teraflops-level of processing power for linear algebra operations, mainly matrix multiplication. Computer architects have responded by introducing ISA extensions, coprocessors and special-purpose cores to accelerate such operations. In particular, Apple ships an undocumented matrix-multiplication coprocessor, AMX, in hundreds of millions of mobile phones, tablets and personal computers. Our work repurposes AMX to implement polynomial multiplication and applies it to the NTRU cryptosystem, setting new speed records on the Apple M1 and M3 systems-on-chip (SoCs): polynomial multiplication, key generation, encapsulation and decapsulation are sped up by $1.54$–$3.07\times$, $1.08$–$1.33\times$, $1.11$–$1.50\times$ and $1.20$–$1.98\times$, respectively, over the previous state-of-the-art.

2024

CIC

Efficient isochronous fixed-weight sampling with applications to NTRU Abstract

Décio Luiz Gazzoni Filho Tomás S. R. Silva Julio López

We present a solution to the open problem of designing a linear-time, unbiased and timing attack-resistant shuffling algorithm for fixed-weight sampling. Although it can be implemented without timing leakages of secret data in any architecture, we illustrate with ARMv7-M and ARMv8-A implementations; for the latter, we take advantage of architectural features such as NEON and conditional instructions, which are representative of features available on architectures targeting similar systems, such as Intel. Our proposed algorithm improves asymptotically upon the current approach based on constant-time sorting networks ($O(n)$ versus $O(n \log^2 n)$), and an implementation of the new algorithm applied to NTRU is also faster in practice, by a factor of up to $6.91\ (591\%)$ on ARMv8-A cores and $12.89\ (1189\%)$ on the Cortex-M4; it also requires fewer uniform random bits. This translates into performance improvements for NTRU encapsulation, compared to state-of-the-art implementations, of up to 50% on ARMv8-A cores and 72% on the Cortex-M4, and small improvements to key generation (up to 2.7% on ARMv8-A cores and 6.1% on the Cortex-M4), with negligible impact on code size and a slight improvement in RAM usage for the Cortex-M4.

2019

JOFC

Koblitz Curves over Quadratic Fields Abstract

Thomaz Oliveira Julio López Daniel Cervantes-Vázquez Francisco Rodríguez-Henríquez

In this work, we retake an old idea that Koblitz presented in his landmark paper (Koblitz, in: Proceedings of CRYPTO 1991. LNCS, vol 576, Springer, Berlin, pp 279–287, 1991 ), where he suggested the possibility of defining anomalous elliptic curves over the base field $${\mathbb {F}}_4$$ F 4 . We present a careful implementation of the base and quadratic field arithmetic required for computing the scalar multiplication operation in such curves. We also introduce two ordinary Koblitz-like elliptic curves defined over $${\mathbb {F}}_4$$ F 4 that are equipped with efficient endomorphisms. To the best of our knowledge, these endomorphisms have not been reported before. In order to achieve a fast reduction procedure, we adopted a redundant trinomial strategy that embeds elements of the field $${\mathbb {F}}_{4^{m}},$$ F 4 m , with m a prime number, into a ring of higher order defined by an almost irreducible trinomial. We also suggest a number of techniques that allow us to take full advantage of the native vector instructions of high-end microprocessors. Our software library achieves the fastest timings reported for the computation of the timing-protected scalar multiplication on Koblitz curves, and competitive timings with respect to the speed records established recently in the computation of the scalar multiplication over binary and prime fields.

2017

CHES

PRESENT Runs Fast Abstract

Tiago B. S. Reis Diego F. Aranha Julio López

The PRESENT block cipher was one of the first hardware-oriented proposals for implementation in extremely resource-constrained environments. Its design is based on 4-bit S-boxes and a 64-bit permutation, a far from optimal choice to achieve good performance in software. As a result, most software implementations require large lookup tables in order to meet efficiency goals. In this paper, we describe a new portable and efficient software implementation of PRESENT, fully protected against timing attacks. Our implementation uses a novel decomposition of the permutation layer, and bitsliced computation of the S-boxes using optimized Boolean formulas, not requiring lookup tables. The implementations are evaluated in embedded ARM CPUs ranging from microcontrollers to full-featured processors equipped with vector instructions. Timings for our software implementation show a significant performance improvement compared to the numbers from the FELICS benchmarking framework. In particular, encrypting 128 bits using CTR mode takes about 2100 cycles on a Cortex-M3, improving on the best Assembly implementation in FELICS by a factor of 8. Additionally, we present the fastest masked implementation of PRESENT for protection against timing and other side-channel attacks in the scenario we consider, improving on related work by 15%. Hence, we conclude that PRESENT can be remarkably efficient in software if implemented with our techniques, and even compete with a software implementation of AES in terms of latency while offering a much smaller code footprint.

2016

CHES

Software Implementation of Koblitz Curves over Quadratic Fields 📺

Thomaz Oliveira Julio López Francisco Rodríguez-Henríquez

2013

CHES

Lambda Coordinates for Binary Elliptic Curves ★

Thomaz Oliveira Julio López Diego F. Aranha Francisco Rodríguez-Henríquez

2011

EUROCRYPT

Faster Explicit Formulas for Computing Pairings over Ordinary Curves

Diego F. Aranha Koray Karabina Patrick Longa Catherine H. Gebotys Julio López