CryptoDB
Simon Phillipp Merz
Publications and invited talks
Year
Venue
Title
2025
RWC
Breaking and Fixing Length Leakage in Content-Defined Chunking
Abstract
Most applications that deduplicate data first split said data in smaller blocks, called chunks, using content-defined chunking (CDC). CDC cuts the chunks based on a local context window in the data: this means that chunks boundaries are preserved when the data is changed, and enables significant deduplication efficiency gains across applications dealing with large redundant dataset such as backup solutions, software patching systems, and file hosting platforms like IPFS and HuggingFace.
However, CDC also introduces a subtle leakage: the length of each chunk leaks information about the data being chunked. This enables fingerprinting attacks, where adversaries exploit chunk length patterns to infer the presence or structure of specific data. Such attacks threaten confidentiality in scenarios ranging from encrypted backups on untrusted cloud servers to data transmitted over encrypted channels. To address these risks, many systems - mainly in the cloud backup setting - have developed bespoke mitigations by mixing a cryptographic key inside the chunking process.
We demonstrate the ineffectiveness of these mitigations by presenting efficient key recovery attacks that rely solely on a known plaintext assumption. These attacks entirely circumvent all folklore mitigations except one, re-enabling fingerprinting attacks. To address this, we introduce a formal treatment for Keyed Content-Defined Chunking (KCDC) schemes and propose a provably secure construction that fulfills a strong notion of security. In doing so, we take a step towards making these real-world systems more resilient against leakage.
Coauthors
- Felix Günther (1)
- Simon Phillipp Merz (1)
- Kenny Paterson (1)
- Matteo Scarlata (1)
- Kien Tuong Truong (1)