IACR News item: 09 June 2025
Simon Langowski, Srini Devadas
Modular arithmetic is the computational backbone of many cryptographic and scientific algorithms.
In particular, modular multiplication in a large prime field is computationally expensive and dictates the runtime of many algorithms. While it is relatively easy to utilize vectorization to accelerate batches of independent modular multiplications, our goal is to reduce the latency of a $\textit{single}$ modular multiplication under a generic prime using vectorization, while maintaining constant-time execution.
We investigate the technique of Residue Number System (RNS) Montgomery modular multiplication. We first contribute a unified view of algorithmic optimizations in prior art. This view enables us to further reduce the number of elementwise multiplications in an algorithm with a simplified structure that we prove correct.
We explore AVX512 acceleration on CPUs, and show how to map our algorithm to vector instructions. We implement our algorithm in C++ and achieve $\approx 4 \times$ speedup, which is nearly maximal, for $1024+$-bit primes on a CPU with AVX512 over optimized library implementations of standard Montgomery modular multiplication algorithms. GPUs contain vector cores that each support tens of physical threads. We show how to intelligently map our algorithm to threads in a vector core, ``overparallelizing'' to minimize latency. We show substantial speedups over a commercial library implementation of standard modular multiplication algorithms across a wide range of prime sizes.
We investigate the technique of Residue Number System (RNS) Montgomery modular multiplication. We first contribute a unified view of algorithmic optimizations in prior art. This view enables us to further reduce the number of elementwise multiplications in an algorithm with a simplified structure that we prove correct.
We explore AVX512 acceleration on CPUs, and show how to map our algorithm to vector instructions. We implement our algorithm in C++ and achieve $\approx 4 \times$ speedup, which is nearly maximal, for $1024+$-bit primes on a CPU with AVX512 over optimized library implementations of standard Montgomery modular multiplication algorithms. GPUs contain vector cores that each support tens of physical threads. We show how to intelligently map our algorithm to threads in a vector core, ``overparallelizing'' to minimize latency. We show substantial speedups over a commercial library implementation of standard modular multiplication algorithms across a wide range of prime sizes.
Additional news items may be found on the IACR news page.