Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
D
ds2-notes
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Deploy
Releases
Model registry
Analyze
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
datovky
ds2-notes
Commits
2682e01d
Commit
2682e01d
authored
6 years ago
by
Martin Mareš
Browse files
Options
Downloads
Patches
Plain Diff
Bloom filters: Optimization cont'd
parent
07035907
No related branches found
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
06-hash/hash.tex
+17
-13
17 additions, 13 deletions
06-hash/hash.tex
with
17 additions
and
13 deletions
06-hash/hash.tex
+
17
−
13
View file @
2682e01d
...
...
@@ -813,7 +813,7 @@ Let $\varepsilon > 0$ be the desired error probability
and
$
n
$
~the maximum number of items in the set.
The
$
k
$
-band Bloom filter with
$
m
=
2
n
$
and
$
k
=
\lceil
\log
(
1
/
\varepsilon
)
\rceil
$
gives false positives with probability at most~
$
\varepsilon
$
.
It requires
$
\O
(
m
\log
(
1
/
\varepsilon
)
)
$
bits of memory and both
It requires
$
2
m
\lceil
\log
(
1
/
\varepsilon
)
\rceil
$
bits of memory and both
\alg
{
Insert
}
and
\alg
{
Lookup
}
run in time
$
\O
(
k
)
$
.
}
...
...
@@ -825,10 +825,11 @@ consumption reaches only 20\thinspace Mb.
\subsection
{
Optimizing parameters
}
The multi-band filter works well, but it turns out that we can fine-tune its parameters
to obtain even better results (although only by a~constant factor). We can view it as
an~optimization problem: given a~memory budget of~
$
M
$
bits, set the parameters
$
m
$
and~
$
k
$
such that the filter fits in memory (
$
mk
\le
M
$
) and the error probability is minimized.
We will assume that all hash functions are perfectly random.
to improve memory consumption by a~constant factor. We can view it as
an~optimization problem: given a~memory budget of~
$
M
$
bits, set the parameters
$
m
$
and~
$
k
$
such that the filter fits in memory (
$
mk
\le
M
$
) and the error
probability is minimized. We will assume that all hash functions are perfectly
random.
Let us focus on a~single band first. If we select its size~
$
m
$
, we can easily
calculate probability that a~given bit is zero. We have
$
n
$
~items, each of them hashed
...
...
@@ -840,26 +841,29 @@ is the probability of false positives. We will find~$p$ such that this probabili
minimized.
If we set~
$
p
$
, it follows that
$
m
\approx
-
n
/
\ln
p
$
. Since all bands must fit in
$
M
$
~bits
of memory, we
must hav
e
$
k
=
\lfloor
M
/
m
\rfloor
\approx
-
M
/
n
\cdot
\ln
p
$
bands. False
of memory, we
want to us
e
$
k
=
\lfloor
M
/
m
\rfloor
\approx
-
M
/
n
\cdot
\ln
p
$
bands. False
positives occur if we find~1 in all bands, which has probability
$$
(
1
-
p
)
^
k
\approx
\e
^{
k
\ln
(
1
-
p
)
}
\approx
\e
^{
-
M
/
n
\cdot
\ln
p
\cdot
\ln
(
1
-
p
)
}
.
$$
As
$
\e
^
x
$
is a
n in
creasing function, it suffices to m
in
imize
$
\ln
p
\cdot
\ln
(
1
-
p
)
$
for
$
p
\in
(
0
,
1
)
$
. By elementary calculus, the m
in
imum is attained for
$
p
=
1
/
2
$
. This
As
$
\e
^
{
-
x
}
$
is a
~de
creasing function, it suffices to m
ax
imize
$
\ln
p
\cdot
\ln
(
1
-
p
)
$
for
$
p
\in
(
0
,
1
)
$
. By elementary calculus, the m
ax
imum is attained for
$
p
=
1
/
2
$
. This
leads to false positive probability
$
(
1
/
2
)
^
k
=
2
^{
-
k
}$
. If we want to push this under~
$
\varepsilon
$
,
we
want to
set
$
k
=
\lceil\log
(
1
/
\varepsilon
)
\rceil
$
,
we set
$
k
=
\lceil\log
(
1
/
\varepsilon
)
\rceil
$
,
so
$
M
=
kn
/
\ln
2
\approx
n
\cdot
\log
(
1
/
\varepsilon
)
\cdot
(
1
/
\ln
2
)
\doteq
n
\cdot
\log
(
1
/
\varepsilon
)
\cdot
1
.
44
$
.
% TODO: Plot ln(p)*ln(1-p)
This improves the constant~2 from the previous construction to approximately 1.44
(TODO).
This improves the constant from the previous theorem from~2 to circa 1.44.
TODO: Lower bound.
\note
{
It is known that any approximate membership data structure with false positive
probability~
$
\varepsilon
$
and no false negatives must use at least
$
n
\log
(
1
/
\varepsilon
)
$
bits of memory. The optimized Bloom filter is therefore within a~factor of 1.44 from the
optimum.
}
\subsection
{
Merged
filters
}
\subsection
{
Single-table
filters
}
TODO
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment