KBF: A BLOOM FILTER FOR KEY-VALUE STORAGE WITH AN APPLICATION ON APPROXIMATE STATE MACHINES
Abstract
Key-value (k-v) storage has been used as a crucial component for many network applications, such as social networks, online retailing, and cloud computing. Such storage usually provides support for operations on key-value pairs, and can be stored in memory to speed up responses to queries. So far, existing methods have been deterministic: they will faithfully return previously inserted key-value pairs. Providing such accuracy, however, comes at the cost of memory and CPU time. In contrast, in this paper, we present an approximate k-v storage that is more compact than existing methods. The tradeoff is that it may, theoretically, return a null value for a valid key with a low probability, or return a valid value for a key that was never inserted. Its design is based on the probabilistic data structure called the “Bloom Filter”, which was originally developed to test element membership in sets. In this paper, we extend the bloom filter concept to support key-value operations, and demonstrate that it still retains the compact nature of the original bloom filter. We call the resulting design as the kBF (key-value bloom filter), and systematically analyze its performance advantages and design tradeoffs. Finally, we apply the kBF to a practical problem of implementing a state machine in network intrusion detection to demonstrate how the kBF can be used as a building block for more complicated software infrastructures
PROPOSED SYSTEM:
The approach we present in this paper aims to address these problems by supporting k-v operations with predictable performance and accuracy. We call it the “key-value bloom filter (kBF)”. In particular, it has the following three key contributions: first, to address the challenge of arbitrary keyvalue pairs, we propose a method to encode the values into a special type of binary encodings that can fit into the cells of bloom filters easily. These encodings are designed to be resilient to collisions, i.e., insertions and queries can still be effectively handled when one or more collisions occur in a cell. In particular, the decoding allows using k hashed locations collaboratively, rather than using any single one of them, so that the successful decoding ratio can be greatly improved. Second, to address the challenge to handle a very large number k-v pairs, we design the kBF to be elastic, so that its capacity can grow and shrink as needed while ensuring that the desired query performance is achieved. To this end, we develop growth and compaction operations on the kBF to support its capacity changes. Third, to address the challenge to achieve predictable performance, we systematically analyze the capacity and decoding ratio of the kBF to demonstrate its performance limits. We derive closed form results to this end, so that we can closely monitor the runtime performance of the kBF to ensure a satisfactory quality of service. In summary, kBF represents a novel type of the bloom filter that supports key-value operations using compact memory storage. To further illustrate its effectiveness, we demonstrate through a specific application example: we use it as a building block to enforce TCP state transition rules by developing an approximate concurrent state machine (ACSM). Using ACSMs, a router can efficiently keep track of many regular expression matchings simultaneously to detect potential intrusions
EXISTING SYSTEM:
In this section, we describe related work in three parts: first we describe the original Bloom Filter design, then we describe its variants, and finally, we describe the related work to the network applications of the Bloom Filter. The bloom filter, originally developed by Burton H. Bloom [8], is a space efficient randomized data structure that answers the question about membership tests. Recently it has received great attention in the networking area [11], [12], [13]. Specifically, the bloom filter allows insertions and queries of elements in sets, by hashing an element to k different locations in a bit array of m bits. To add an element, each of the k bits is set to 1. To query it, each of the k bits is tested against 1, and any 0 found will tell that the element is not in the set. In this way, no false negatives will occur, but false positives are possible, since all k bits might have been set to 1 due to other elements have been hashed to the same positions. The bloom filter is highly compact: it needs 10 bits to store each element to achieve a false positive rate of 1%, independent of the number and size of the inserted elements. Therefore, in situations where only limited on-chip RAM is available, a bloom filter becomes particularly useful. After the original Bloom Filter was proposed, many variants followed [13], [14], [15]. One relevant work is the counting bloom filter [9], which has m counters along with m bits. This way, the CBF can support not only deletion operations, but also frequency queries. However, the CBF is not designed for key-value operations, hence, is also significantly different from our work
CONCLUSIONS
In this paper, by using the classic bloom filter as a base design, we extend it into a approximate key-value storage scheme called the kBF. We present a comprehensive investigation on the algorithm design of the kBF, analyze its performance in storing large datasets, and evaluate its performance in both synthetic workloads and a real application study. According to our experiment results, the kBF is highly compact, and supports insertion, query, update and deletion operations with adjustable error ratios. Compared to deterministic schemes, the kBF is more suitable to be implemented in devices with limited RAM space and timing constraints, as long as approximate results are tolerated by application semantics. We also demonstrate, through an application case study for detecting suspicious TCP flows, the kBF performs much better than the related approach in the literature in terms of error rates. Therefore, we believe that our study of the kBF can be beneficial for fast and low overhead key-value storage purposes in a wide range of applications
REFERENCES
[1] V. Vasudevan, M. Kaminsky, and D. G. Andersen, “Using Vector Interfaces To Deliver Millions Of Iops From A Networked Key-value Storage Server,” in Proceedings of the ACM SOCC, 2012.
[2] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, “Workload Analysis Of A Large-scale Key-value Store,” in Proceedings of the ACM SIGMETRICS, 2012.
[3] G. Decandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo : Amazon’s Highly Available Key-value Store,” in Proceedings of the ACM SOSP, 2007, pp. 205–220.
[4] Apache Foundation, “Cassandra Website,” http://cassandra.apache.org/.
[5] “Memcached Website,” http://www.memcached.org/.
[6] “Redis Website,” http://redis.io/.
[7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A Distributed Storage System for Structured Data,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006.
[8] B. H. Bloom, “Space / Time Trade-offs in Hash Coding with Allowable Errors,” Communications of the ACM, vol. 13, no. 7, 1970.
[9] L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary cache: a scalable wide-area web cache sharing protocol,” IEEE/ACM Transactions on Networking, vol. 8, no. 3, pp. 281–293, Jun. 2000. [Online]. Available: http://dx.doi.org/10.1109/90.851975
[10] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, “Beyond Bloom Filters : From Approximate Membership Checks to Approximate State Machines,” in Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, 2006, pp. 315–326.
[11] A. Broder and M. Mitzenmacher, “Network Applications of Bloom Filters: A Survey,” Internet Mathematics, vol. 1, no. 4, pp. 485–509, 2003.