APPLICATION-AWARE BIG DATA DEDUPLICATION IN CLOUD ENVIRONMENT
ABSTRACT
Deduplication has become a widely deployed technology in cloud data centers to improve IT resources efficiency. However, traditional techniques face a great challenge in big data deduplication to strike a sensible tradeoff between the conflicting goals of scalable deduplication throughput and high duplicate elimination ratio. We propose AppDedupe, an application-aware scalable inline distributed deduplication framework in cloud environment, to meet this challenge by exploiting application awareness, data similarity and locality to optimize distributed deduplication with inter-node two-tiered data routing and intra-node application-aware deduplication. It first dispenses application data at file level with an application-aware routing to keep application locality, then assigns similar application data to the same storage node at the super-chunk granularity using a handprinting-based stateful data routing scheme to maintain high global deduplication efficiency, meanwhile balances the workload across nodes. AppDedupe builds application-aware similarity indices with super-chunk handprints to speedup the intra-node deduplication process with high efficiency. Our experimental evaluation of AppDedupe against state-of-the-art, driven by real-world datasets, demonstrates that AppDedupe achieves the highest global deduplication efficiency with a higher global deduplication effectiveness than the high-overhead and poorly scalable traditional scheme, but at an overhead only slightly higher than that of the scalable but low duplicate-elimination-ratio approaches.
CONCLUSION
In this paper, we describe AppDedupe, an application-aware scalable inline distributed deduplication frame-work for big data management, which achieves a tradeoff between scalable performance and distributed deduplica-tion effectiveness by exploiting application awareness, data similarity and locality. It adopts a two-tiered data routing scheme to route data at the super-chunk granular-ity to reduce cross-node data redundancy with good load balance and low communication overhead, and employs application-aware similarity index based optimization to improve deduplication efficiency in each node with very low RAM usage. Our real-world trace-driven evaluation clearly demonstrates AppDedupe’s significant adven-tages over the state-of-the-art distributed deduplication schemes for large clusters in the following important two ways. First, it outperforms the extremely costly and poor-ly scalable stateful tight coupling scheme in the cluster-wide deduplication ratio but only at a slightly higher system overhead than the highly scalable loose coupling schemes. Second, it significantly improves the stateless loose coupling schemes in the cluster-wide effective de-duplication ratio while retaining the latter’s high system scalability with low overhead.
EXISTING SYSTEM:
Unfortunately, this chunk-based inline distributed de-duplication framework at large scales faces challenges in both inter-node and intra-node scenarios. First, for the inter-node scenario, different from those distributed de-duplication with high overhead in global match query , there is a challenge called deduplication node in-formation island. It means that deduplication is only per-formed within individual nodes due to the communica-tion overhead considerations, and leaves the cross-node redundancy untouched. Second, for the intra-node sce-nario, it suffers from the chunk index lookup disk bottleneck. There is a chunk index of a large dataset, which maps each chunk’s fingerprint to where that chunk is stored on disk in order to identify the replicated data. It is generally too big to fit into the limited memory of a deduplication node, and causes the parallel deduplication perfor-mance of multiple data streams to degrade significantly due to the frequent and random disk index I/Os.
PROPOSED SYSTEM:
The proposed AppDedupe distributed deduplication system has the following salient features that distinguish it from the state-of-the-art mechanisms: ─ To the best of our knowledge, AppDedupe is the first research work on leveraging application aware-ness in the context of distributed deduplication. ─ It performs two-tiered routing decision by exploit-ing application awareness, data similarity and local-ity to direct data routing from clients to deduplica-tion storage nodes to achieve a good tradeoff be-tween the conflicting goals of high deduplication ef-fectiveness and low system overhead. ─ It builds a global application route table and inde-pendent similarity indices with super-chunk hand-prints over the traditional chunk-fingerprint index-ing scheme to alleviate the chunk lookup disk bot-tleneck for deduplication in each storage node. ─ Evaluation results show that it consistently and sig-nificantly outperforms the state-of-the-art schemes in distributed deduplication efficiency by achieving high global deduplication effectiveness with bal-anced storage usage across the nodes and high par-allel deduplication throughput at a low inter-node communication overhead.
REFERENCES
[1] J. Gantz, D. Reinsel, “The Digital Universe Decade-Are You Ready?” White Paper, IDC, May 2010.
[2] H. Biggar, “Experiencing Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements,” White Paper, the Enterprise Strategy Group, Feb. 2007.
[3] K.R. Jayaram, C. Peng, Z. Zhang, M. Kim, H. Chen, H. Lei. “An Empirical Analysis of Similarity in Virtual Machine Images,” Proc. Of the ACM/IFIP/USENIX Middleware Industry Track Workshop (Middleware’11), Dec. 2011.
[4] K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. “iDedup: Latency-aware, inline data deduplication for prima-ry storage,” Proc. of the 10th USENIX Conference on File and Storage Technologies (FAST’12). Feb. 2012.
[5] P. Shilane, M. Huang, G. Wallace, and W. Hsu. “WAN opti-mized replication of backup datasets using stream-informed delta compression,” ACM Transactions on Storage (TOS), 8(4): 915-921, Nov. 2012.
[6] D. Bhagwat, K. Eshghi, D.D. Long, M. Lillibridge, “Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup,” Proc. of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Tele-communication Systems (MASCOTS’09), pp.1-9, Sep. 2009.
[7] W. , F. Douglis, K. Li, H. Patterson, S. Reddy, P. Shilane, “Tradeoffs in Scalable Data Routing for Deduplication Clus-ters,” Proc. of the 9th USENIX Conf. on File and Storage Tech-nologies (FAST’11), pp. 15-29, Feb. 2011.
[8] F. Douglis, D. Bhardwaj, H. Qian, P. Shilane, “Content-aware Load Balancing for Distributed Backup,” Proc. of the 25th USENIX Conf. on Large Installation System Administration (LISA’11), pp.151-168, Dec. 2012.
[9] C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepko-wski, C. Ungureanu, M. Welnicki, “HYDRAstor: a Scalable Secondary Storage,” Proc. of the 7th USENIX Conf. on File and Storage Technologies (FAST‘09), pp. 197-210, Feb. 2009.
[10] J. Wei, H. Jiang, K. Zhou, D. Feng, “MAD2: A Scalable High Throughput Exact Deduplication Approach for Network Backup Services,” Proc. of the 26th IEEE Conf. on Mass Storage Systems and Technologies (MSST’10), pp. 1-14, May 2010