Content-Defined Chunking for Data Deduplication

Created with Sketch.

Content-defined chunking (CDC) algorithms are used in data deduplication to divide incoming data streams into chunks, increasing the probability of finding duplicates. They directly impact the space savings and throughput of deduplication systems, as they are run millions of times on the critical path. Our work focuses on various aspects of CDC algorithms, including techniques to accelerate them, examining the impact of dataset characteristics on their functionality, and enhancing their designs to alleviate the penalties they incur.

Downloads


People


Publications


[1] VectorCDC: Accelerating Data Deduplication with SSE/AVX Instructions
Sreeharsha Udayashankar, Abdelrahman Baba, Samer Al-Kiswany
USENIX Conference on File and Storage Technologies (FAST), Feb. 2025 [pdf] [slides]

[2] SeqCDC: Hashless Content-Defined Chunking for Data Deduplication
Sreeharsha Udayashankar, Abdelrahman Baba, Samer Al-Kiswany
ACM/IFIP International Middleware Conference (MIDDLEWARE), Dec. 2024 [pdf] [slides]

[3] The Impact of Low-Entropy on Chunking Techniques for Data Deduplication
Mu’men Al Jarah, Sreeharsha Udayashankar, Abdelrahman Baba, Samer Al-Kiswany
IEEE International Conference on Cloud Computing (CLOUD), Jul. 2024 [pdf]

[4] DedupBench: A Benchmarking Tool for Data Chunking Techniques
Alan Liu, Abdelrahman Baba, Sreeharsha Udayashankar, Samer Al-Kiswany
IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), 2023. [pdf]

Patents


[1] Systems and methods of secure deduplication of encrypted content
Samer Al-Kiswany, Sreeharsha Udayashankar, Abdelrahman Baba, Serg Bell, Stanislav Protasov
US Patent Office (US20250005171A1), Jun. 2023. [patent]

[2] Systems and methods for executing jump-based content-defined data chunking
Abdelrahman Baba, Sreeharsha Udayashankar, Samer Al-Kiswany, Serg Bell, Stanislav Protasov
US Patent Office (US20250110924A1), Sep. 2023 [patent]