International Association for Cryptologic Research

International Association
for Cryptologic Research


Julio López


The PRESENT block cipher was one of the first hardware-oriented proposals for implementation in extremely resource-constrained environments. Its design is based on 4-bit S-boxes and a 64-bit permutation, a far from optimal choice to achieve good performance in software. As a result, most software implementations require large lookup tables in order to meet efficiency goals. In this paper, we describe a new portable and efficient software implementation of PRESENT, fully protected against timing attacks. Our implementation uses a novel decomposition of the permutation layer, and bitsliced computation of the S-boxes using optimized Boolean formulas, not requiring lookup tables. The implementations are evaluated in embedded ARM CPUs ranging from microcontrollers to full-featured processors equipped with vector instructions. Timings for our software implementation show a significant performance improvement compared to the numbers from the FELICS benchmarking framework. In particular, encrypting 128 bits using CTR mode takes about 2100 cycles on a Cortex-M3, improving on the best Assembly implementation in FELICS by a factor of 8. Additionally, we present the fastest masked implementation of PRESENT for protection against timing and other side-channel attacks in the scenario we consider, improving on related work by 15%. Hence, we conclude that PRESENT can be remarkably efficient in software if implemented with our techniques, and even compete with a software implementation of AES in terms of latency while offering a much smaller code footprint.
Improving the performance of Luffa Hash Algorithm
Thomaz Oliveira Julio López
Luffa is a new hash algorithm that has been accepted for round two of the NIST hash function competition SHA-3. Computational efficiency is the second most important evaluation criteria used to compare candidate algorithms. In this paper, we describe a fast software implementation of the Luffa hash algorithm for the Intel Core 2 Duo platform. We explore the use of the perfect shuffle operation to improve the performance of 64-bit implementation and 128-bit implementation with the Intel Supplemental SSSE3 instructions. In addition, we introduce a new way of implementing Luffa based on a Parallel Table Lookup instruction. The timings of our 64-bit implementation (C code) resulted in a 16 to 32% speed improvement over the previous fastest implementation.
TinyTate: Identity-Based Encryption for Sensor Networks
In spite of several years of intense research, the area of security and cryptography in Wireless Sensor Networks (WSNs) still has a number of open problems. On the other hand, the advent of Identity-Based Encryption (IBE) has enabled a wide range of new cryptographic solutions. In this work, we argue that IBE is ideal for WSNs and vice versa. We discuss the synergy between the systems, describe how WSNs can take advantage of IBE, and present results for computation of the Tate pairing over resource constrained nodes.
Low Complexity Bit-Parallel Square Root Computation over GF($2^m$) for all Trinomials
In this contribution we introduce a low-complexity bit-parallel algorithm for computing square roots over binary extension fields. Our proposed method can be applied for any type of irreducible polynomials. We derive explicit formulae for the space and time complexities associated to the square root operator when working with binary extension fields generated using irreducible trinomials. We show that for those finite fields, it is possible to compute the square root of an arbitrary field element with equal or better hardware efficiency than the one associated to the field squaring operation. Furthermore, a practical application of the square root operator in the domain of field exponentiation computation is presented. It is shown that by using as building blocks squarers, multipliers and square root blocks, a parallel version of the classical square-and-multiply exponentiation algorithm can be obtained. A hardware implementation of that parallel version may provide a speedup of up to 50\% percent when compared with the traditional version.