Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
D
ds2-notes
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Deploy
Releases
Model registry
Analyze
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
datovky
ds2-notes
Commits
43f9bcc7
Commit
43f9bcc7
authored
6 years ago
by
Martin Mareš
Browse files
Options
Downloads
Patches
Plain Diff
Bloom filters: 1-band and k-band version
parent
512395ee
Branches
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
06-hash/hash.tex
+98
-0
98 additions, 0 deletions
06-hash/hash.tex
with
98 additions
and
0 deletions
06-hash/hash.tex
+
98
−
0
View file @
43f9bcc7
...
...
@@ -678,4 +678,102 @@ is at most a~constant. This concludes the proof of the theorem.
TODO: Concentration inequalities and 5-independence.
\section
{
Bloom filters
}
Bloom filters are a~family of data structures for approximate representation of sets
in a~small amount of memory. A~Bloom filter starts with an empty set. Then it supports
insertion of new elements and membership queries. Sometimes, the filter gives a~
\em
{
false
positive
}
answer: it answers
{
\csc
yes
}
even though the element is not in the set.
We will calculate the probability of false positves and decrease it at the expense of
making the structure slightly larger. False negatives will never occur.
\subsection
{
A trivial example
}
We start with a~very simple filter. Let~
$
h
$
be a~hash function from a~universe~
$
\cal
U
$
to
$
[
m
]
$
, picked at random from a~
$
c
$
-universal family. For simplicity, we will assume
that
$
c
=
1
$
. The output of the hash function will serve as an~index to an~array
$
B
[
0
\ldots
m
-
1
]
$
of bits.
At the beginning, all bits of the array are zero.
When we insert an element~
$
x
$
, we simply set the bit
$
B
[
h
(
x
)]
$
to~1.
A~query for~
$
x
$
tests the bit
$
B
[
h
(
x
)]
$
and answers
{
\csc
yes
}
iff the bit is set to~1.
(We can imagine that we are hashing items to
$
m
$
~buckets, but we store only which
buckets are non-empty.)
Suppose that we have already inserted items
$
x
_
1
,
\ldots
,x
_
n
$
. If we query the filter
for any~
$
x
_
i
$
, it always answers
{
\csc
yes.
}
But if we ask for a~
$
y
$
different from
all~
$
x
_
i
$
's, we can get a~false positive answer if
$
x
$
~falls to the same bucket
as one of the
$
x
_
i
$
's.
Let us calculate the probability of a~false positive answer.
For a~concrete~
$
i
$
, we have
$
\Pr
_
h
[
h
(
y
)
=
h
(
x
_
i
)]
\le
1
/
m
$
by 1-universality.
By union bound, the probability that
$
h
(
y
)
=
h
(
x
_
i
)
$
for least one~
$
i
$
is at most
$
n
/
m
$
.
We can ask an~inverse question, too: how large filter do we need to push error
probability under some
$
\varepsilon
>
0
$
? By our calculation,
$
\lceil
n
/
\varepsilon\rceil
$
bits suffice. It is interesting that this size does not depend on the size of the universe
--- all previous data structures required at least
$
\log\vert
{
\cal
U
}
\vert
$
bits per item.
On the other hand, the size scales badly with error probability: for example,
a~filter for
$
10
^
6
$
items with
$
\varepsilon
=
0
.
01
$
requires 100
\thinspace
Mb.
\subsection
{
Multi-band filters
}
To achieve the same error probability in smaller space, we can simply run
multiple filters in parallel. We choose
$
k
$
~hash functions
$
h
_
1
,
\ldots
,h
_
k
$
,
where
$
h
_
i
$
~maps the universe to a~separate array~
$
B
_
i
$
of
$
m
$
~bits. Each
pair
$
(
B
_
i,h
_
i
)
$
is called a~
\em
{
band
}
of the filter.
Insertion adds the new item to all bands. A~query asks all bands and it answers
{
\csc
yes
}
only if each band answered
{
\csc
yes
}
.
We shall calculate error probability of the
$
k
$
-band filter. Suppose that we set
$
m
=
2
n
$
, so that each band gives a~false positive with probability at most
$
1
/
2
$
.
The whole filter gives a~false positive only if all bands did, which happens with
probability at most
$
2
^{
-
k
}$
if the functons
$
h
_
1
,
\ldots
,h
_
k
$
where chosen independently.
This proves the following theorem.
\theorem
{
Let
$
\varepsilon
>
0
$
be the desired error probability
and
$
n
$
~the maximum number of items in the set.
The
$
k
$
-band Bloom filter with
$
m
=
2
n
$
and
$
k
=
\lceil
\log
(
1
/
\varepsilon
)
\rceil
$
gives false positives with probability at most~
$
\varepsilon
$
.
It requires
$
\O
(
m
\log
(
1
/
\varepsilon
))
$
bits of memory and both
\alg
{
Insert
}
and
\alg
{
Lookup
}
run in time
$
\O
(
k
)
$
.
}
In the example with
$
n
=
10
^
6
$
and
$
\varepsilon
=
0
.
01
$
, we get
$
m
=
2
\cdot
10
^
6
$
and
$
k
=
7
$
, so the whole filter requires 14
\thinspace
Mb. If we decrease
$
\varepsilon
$
to
$
0
.
001
$
, we have to increase~
$
k
$
only to~10, so the memory
consumption reaches only 20
\thinspace
Mb.
\subsection
{
Optimizing parameters
}
% The multi-band filter works well, but it turns out that we can fine-tune its parameters
% to obtain even better results (although only by a~constant factor). We can view it as
% an~optimization problem: given a~memory budget of~$M$ bits, set the parameters $m$ and~$k$
% such that the filter fits in memory ($mk \le M$) and the error probability is minimized.
%
% Let us focus on a~single band first and calculate probability of false positives.
% This time, we will assume that all hash functions are perfectly random.
% A~concrete~$x_i$ maps to a~bucket~$j$ with probability $1/m$,
% so the probability that the bit~$B[j]$ is zero after all $n$ items are inserted
% is $(1-(1/m))^n$. This can be approximated by $\e^{-n/m}$.
% For an item~$y$ outside the set, we get a~false positive only if $B[h(y)]=1$,
% which happens with probability approximately $1-p$.
TODO
\subsection
{
Merged filters
}
TODO
\subsection
{
Counting filters
}
TODO
\subsection
{
Representing functions: the Bloomier filters
}
TODO
\endchapter
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment