International Association for Cryptologic Research

International Association
for Cryptologic Research

CryptoDB

Simon Phillipp Merz

Publications and invited talks

Year
Venue
Title
2025
RWC
Breaking and Fixing Length Leakage in Content-Defined Chunking
Most applications that deduplicate data first split said data in smaller blocks, called chunks, using content-defined chunking (CDC). CDC cuts the chunks based on a local context window in the data: this means that chunks boundaries are preserved when the data is changed, and enables significant deduplication efficiency gains across applications dealing with large redundant dataset such as backup solutions, software patching systems, and file hosting platforms like IPFS and HuggingFace. However, CDC also introduces a subtle leakage: the length of each chunk leaks information about the data being chunked. This enables fingerprinting attacks, where adversaries exploit chunk length patterns to infer the presence or structure of specific data. Such attacks threaten confidentiality in scenarios ranging from encrypted backups on untrusted cloud servers to data transmitted over encrypted channels. To address these risks, many systems - mainly in the cloud backup setting - have developed bespoke mitigations by mixing a cryptographic key inside the chunking process. We demonstrate the ineffectiveness of these mitigations by presenting efficient key recovery attacks that rely solely on a known plaintext assumption. These attacks entirely circumvent all folklore mitigations except one, re-enabling fingerprinting attacks. To address this, we introduce a formal treatment for Keyed Content-Defined Chunking (KCDC) schemes and propose a provably secure construction that fulfills a strong notion of security. In doing so, we take a step towards making these real-world systems more resilient against leakage.