Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
D
ds2-notes
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Deploy
Releases
Model registry
Analyze
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
datovky
ds2-notes
Commits
319acc0d
Commit
319acc0d
authored
6 years ago
by
Martin Mareš
Browse files
Options
Downloads
Patches
Plain Diff
Bloom filters: more variants
parent
ec53dced
No related branches found
No related tags found
No related merge requests found
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
06-hash/hash.tex
+69
-4
69 additions, 4 deletions
06-hash/hash.tex
tex/adsmac.tex
+3
-1
3 additions, 1 deletion
tex/adsmac.tex
with
72 additions
and
5 deletions
06-hash/hash.tex
+
69
−
4
View file @
319acc0d
...
...
@@ -865,14 +865,79 @@ optimum.}
\subsection
{
Single-table filters
}
TODO
It is also possible to construct a~Bloom filter, where multiple hash functions
point to bits in a~shared table. (In fact, this was the original construction by Bloom.)
Consider
$
k
$
~hash functions
$
h
_
1
,
\ldots
,h
_
k
$
mapping the universe to~
$
[
m
]
$
and a~bit array
$
B
[
0
,
\ldots
,m
-
1
]
$
.
$
\alg
{
Insert
}
(
x
)
$
sets the bits
$
B
[
h
_
1
(
x
)]
,
\ldots
,B
[
h
_
k
(
x
)]
$
to~1.
$
\alg
{
Lookup
}
(
x
)
$
returns
{
\csc
yes
}
, if all these bits are set.
This filter can be analysed similarly to the
$
k
$
-band version. We will assume that
all hash functions are perfectly random and mutually independent.
Insertion of
$
n
$
~elements sets
$
kn
$
bits (not necessarily distinct), so the
probability that a~fixed bit
$
B
[
i
]
$
is set is
$
(
1
-
1
/
m
)
^{
nk
}$
, which is approximately
$
p
=
\e
^{
-
nk
/
m
}$
. We will find the optimum value of~
$
p
$
, for which the probability
of false positives is minimized. For fixed~
$
m
$
, we get
$
k
=
-
m
/
n
\cdot\ln
p
$
.
We get a~false positive if all bits
$
B
[
h
_
i
(
x
)]
$
are set. This happens with probability
approximately
\foot
{
We are cheating a~little bit here: the events
$
B
[
i
]=
1
$
for different~
$
i
$
are not mutually independent. However, further analysis shows that
they are very little correlated, so our approximation holds.
}
$
(
1
-
p
)
^
k
=
(
1
-
p
)
^{
-
m
/
n
\cdot\ln
p
}
=
\exp
(-
m
/
n
\cdot\ln
p
\cdot\ln
(
1
-
p
))
$
.
Again, this is minimized for
$
p
=
1
/
2
$
. So for a~fixed error probability~
$
\varepsilon
$
,
we get
$
k
=
\lceil\log
(
1
/
\varepsilon
)
\rceil
$
and
$
m
=
kn
/
\ln
2
\doteq
1
.
44
\cdot
n
\cdot\lceil\log
(
1
/
\varepsilon
)
\rceil
$
.
We see that as far as our approximation can tell, single-table Bloom filters
achieve the same performance as the
$
k
$
-band version.
% TODO
% \subsection{Set operations}
\subsection
{
Counting filters
}
TODO
An~ordinary Bloom filter does not support deletion: when we delete an~item, we do not
know if some of its bits are shared with other items. There is an~easy solution: instead
of bits, keep
$
b
$
-bit counters
$
C
[
0
\ldots
m
-
1
]
$
.
\alg
{
Insert
}
increments the counters,
\alg
{
Delete
}
decrements
them, and
\alg
{
Lookup
}
returns
{
\csc
yes
}
if all counters are non-zero.
\subsection
{
Representing functions: the Bloomier filters
}
However, since the counters have limited range, they can overflow. We will handle overflows
by keeping the counter at the maximum allowed value
$
2
^
b
-
1
$
, which will not be changed by
subsequent insertions nor deletions. We say that the counter is
\em
{
stuck.
}
Obviously,
too many stuck counters will degrade the data structure. We will show that this happens
with small probability only.
TODO
We will assume a~single-band filter with one fully random hash function and
$
m
$
~counters after
insertion of~
$
n
$
items. For fixed counter value~
$
t
$
, we have
$$
\Pr
[
C
[
i
]=
t
]
=
{
n
\choose
t
}
\cdot
\left
(
1
\over
m
\right
)
^
t
\cdot
\left
(
1
-
{
1
\over
m
}
\right
)
^{
n
-
t
}
,
$$
because for each of
$
n
\choose
t
$
$
t
$
-tuples we have probability
$
(
1
/
m
)
^
t
$
that the
tuple is hashed to~
$
i
$
and probability
$
(
1
-
1
/
m
)
^{
n
-
t
}$
that all other items are
hashed elsewhere.
If
$
C
[
i
]
\ge
t
$
, there must exist a~
$
t
$
-tuple hashed to~
$
i
$
and the remaining items
can be hashed anywhere. Therefore:
$$
\Pr
[
C
[
i
]
\ge
t
]
\le
{
n
\choose
t
}
\cdot
\left
(
1
\over
m
\right
)
^
t.
$$
Since
${
n
\choose
t
}
\le
(
n
\e
/
t
)
^
t
$
, we have
$$
\Pr
[
C
[
i
]
\ge
t
]
\le
\left
(
n
\e
\over
t
\right
)
^
t
\cdot
\left
(
1
\over
m
\right
)
^
t
=
\left
(
ne
\over
mt
\right
)
^
t.
$$
As we already know that the optimum~
$
m
$
is approximately
$
n
/
\ln
2
$
, the probability is
at most
$
(
\e\ln
2
/
t
)
^
t
$
.
By union bound, the probability that there exists a~stuck counter is at most
$
m
$
-times more.
\example
{
A~4-bit counter is stuck when it reaches
$
t
=
15
$
, which by our bound happens with probability at most
$
3
.
06
\cdot
10
^{
-
14
}$
.
If we have
$
m
=
10
^
9
$
counters, the probability that any is stuck is at most
$
3
.
06
\cdot
10
^
5
$
.
So for any reasonably large table, 4-bit counters are sufficient and they seldom get stuck.
Of course, for a~very long sequence of operations, stuck counters eventually accumulate,
so we should preferably rebuild the structure occasionally.
}
% TODO
% \subsection{Representing functions: the Bloomier filters}
\endchapter
This diff is collapsed.
Click to expand it.
tex/adsmac.tex
+
3
−
1
View file @
319acc0d
...
...
@@ -214,7 +214,9 @@
% Poznamky pod carou
\newcount\footcnt
\footcnt
=0
\def\foot
#1
{
\global\advance\footcnt
by 1
\footmark
{
\the\footcnt
}
%
\def\foot
#1
{
%
\nobreak\hskip
0pt
% Allow hyphenation of the preceding word
\global\advance\footcnt
by 1
\footmark
{
\the\footcnt
}
%
\insert\footins
{
\interlinepenalty
=
\interfootnotelinepenalty
\splittopskip
=
\ht\strutbox
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment