IACR News item: 12 June 2025
Shutong Jin, Shiyu Shen, Hao Yang, Donglong Chen, Wangchen Dai, Ray C. C. Cheung
Privacy-preserving neural network inference using Fully Homomorphic Encryption (FHE) faces significant challenges in efficiently evaluating non-polynomial functions, such as activation functions, which are critical for introducing non-linearity in neural networks. Full-Domain Functional Bootstrap (FDFB) algorithms provide a promising solution by enabling the evaluation of arbitrary functions while simultaneously refreshing ciphertexts to manage noise accumulation. Despite their theoretical advantages, the practicality of FDFB algorithms has been limited by excessive computational overhead, often exceeding 1000 ms per ciphertext, which restricts their scalability for large neural networks.
To overcome the computational bottlenecks of FDFB, we have re-engineered the algorithms for massively parallel execution on GPUs. Our primary contribution is a hierarchical parallelization strategy that exploits concurrency at the thread, stream, and device levels. A key optimization involves the use of CUDA streams to create a data pipeline that effectively mitigates the overhead of memory transfers between the host and device. This optimized architecture achieves a significant speedup of up to 524$\times$ compared to CPU-based implementations. Our implementation maintains full precision for evaluating various activation functions, confirming its viability for large-scale, privacy-preserving machine learning tasks and paving the way for practical FHE-based deep learning.
To overcome the computational bottlenecks of FDFB, we have re-engineered the algorithms for massively parallel execution on GPUs. Our primary contribution is a hierarchical parallelization strategy that exploits concurrency at the thread, stream, and device levels. A key optimization involves the use of CUDA streams to create a data pipeline that effectively mitigates the overhead of memory transfers between the host and device. This optimized architecture achieves a significant speedup of up to 524$\times$ compared to CPU-based implementations. Our implementation maintains full precision for evaluating various activation functions, confirming its viability for large-scale, privacy-preserving machine learning tasks and paving the way for practical FHE-based deep learning.
Additional news items may be found on the IACR news page.