International Association for Cryptologic Research

International Association
for Cryptologic Research

CryptoDB

Fast and Clean: Auditable high-performance assembly via constraint solving

Authors:
Amin Abdulrahman , Ruhr University Bochum, Bochum, Germany; Max Planck Institute for Security and Privacy, Bochum, Germany
Hanno Becker , Automated Reasoning Group, Amazon Web Services, Cambridge, United Kingdom
Matthias J. Kannwischer , Quantum Safe Migration Center, Chelpis Quantum Tech, Taipei, Taiwan
Fabien Klein , Arm Limited
Download:
DOI: 10.46586/tches.v2024.i1.87-132
URL: https://tches.iacr.org/index.php/TCHES/article/view/11241
Search ePrint
Search Google
Abstract: Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice.In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture.We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.
BibTeX
@article{tches-2023-33663,
  title={Fast and Clean: Auditable high-performance assembly via constraint solving},
  journal={IACR Transactions on Cryptographic Hardware and Embedded Systems},
  publisher={Ruhr-Universität Bochum},
  volume={024 No. 1},
  pages={87-132},
  url={https://tches.iacr.org/index.php/TCHES/article/view/11241},
  doi={10.46586/tches.v2024.i1.87-132},
  author={Amin Abdulrahman and Hanno Becker and Matthias J. Kannwischer and Fabien Klein},
  year=2023
}