Modern digital communication relies heavily on cryptographic protection to ensure
data integrity and privacy. In order to deploy state-of-the art cryptographic primitives and protocols in real-world scenarios, one needs to highly optimize software for both speed and security. This requires careful choices of high-level cryptographic
parameters, low-level optimization of software on the assembly level for a given microarchitecture and considerations of the subtle interactions between high-level and low-level optimizations. This thesis considers three examples of cryptographic primitives and describes software implementations of these primitives that set new speed records.

The Advanced Encryption Standard (AES) is one of the most widely used symmetric cryptographic primitives. The traditional implementation approach for AES
is based on table lookups. While software based on this approach still achieves best performance for a variety of 32-bit and 64-bit architectures, it is usually vulnerable to cache-timing attacks. Another implementation approach for AES is the bitslic-
ing technique. Not only is software based on this approach inherently protected against cache-timing attacks, on some microarchitectures it even achieves better performance.

Elliptic-curve cryptography is the current state of the art of asymmetric cryptography. For elliptic-curve Diffie-Hellman key exchange, Bernstein proposed the Curve25519 function. Several speed-record-setting implementations of this function
have been developed for a variety of architectures. Optimizing Curve25519 software for the Synergistic Processor Units of the Cell Broadband Engine is a particularly interesting challenge because the small integer multipliers of this architecture do not
seem to make it the best-suited platform for public-key cryptography.

Another use of elliptic curves in cryptography is in the construction of cryptographic pairings. In order to make pairings fast and secure, very special elliptic curves—so-called pairing-friendly curves—are required. For cryptographic pairings on the 128-bit security level Barreto-Naehrig curves of size about 256 bits are the best choice. Optimizing pairing software is more complex than optimizing, e.g.,
elliptic-curve Diffie-Hellman key-exchange software. The reason is that pairings involve multiple computation steps and multiple mathematical structures. A choice of parameters considering only some of these steps or structures is likely to incur high performance penalties in the other steps or structures.

Evaluating the security of cryptographic primitives requires cryptanalytic effort. In many ways optimizing cryptanalytic algorithms is similar to optimizing cryptographic primitives: Careful choices of high-level algorithmic parameters such as a
Pollard rho iteration function need to be combined with low-level software optimization. A major difference when optimizing cryptanalytic algorithms is the high degree of parallelism that requires additional understanding in optimizing parallel
algorithms and in network protocols. This thesis considers two cryptanalytical applications with very different performance bottlenecks and optimization requirements.

Pollard’s rho algorithm is the best known algorithm to solve the elliptic-curve discrete-logarithm problem for most prime-order elliptic-curve groups. Large instances of this problem, such as Certicom’s challenge ECC2K-130 considered in this thesis, are usually solved using a parallel version of Pollard’s rho algorithm, which uses a client-server approach. The efficiency of this approach is mainly determined
by the speed of the iteration function, which runs on all clients independently in parallel.

A cryptanalytical algorithm with significantly more complex parallelization requirements is Wagner’s tree algorithm. This algorithm involves a huge amount of
data for cryptographically relevant inputs; each byte of data needs to be loaded and stored several times. In a parallel environment with distributed storage data cannot be kept local: each byte also needs to be sent various times over the network. The implementation of Wagner’s tree algorithm to find a collision in the toy version FSB48 of the SHA-3 round-1 candidate FSB on a cluster of 8 computers with a total of 5.5 TB of distributed storage demonstrates techniques to apply this algorithm in
storage-restricted environments.