Skip to content
Snippets Groups Projects
Commit 8d2da197 authored by Martin Mareš's avatar Martin Mareš
Browse files

Hashing: Proof for the scalar product family

parent a5fa737c
No related branches found
No related tags found
No related merge requests found
...@@ -21,7 +21,7 @@ ...@@ -21,7 +21,7 @@
We say that the family is \em{$c$-universal} for some $c>0$ if We say that the family is \em{$c$-universal} for some $c>0$ if
for every pair $x,y\in {\cal U}$ of dictinct items we have for every pair $x,y\in {\cal U}$ of dictinct items we have
$$ $$
\Pr_{h\in{\cal H}} [h(x) = h(y)] \le {c\over m}. \Prsub{h\in{\cal H}} [h(x) = h(y)] \le {c\over m}.
$$ $$
In other words, if we pick a~hash function~$h$ uniformly at random from~$\cal H$, In other words, if we pick a~hash function~$h$ uniformly at random from~$\cal H$,
the probability that $x$ and~$y$ collide is at most $c$-times more than for the probability that $x$ and~$y$ collide is at most $c$-times more than for
...@@ -90,7 +90,7 @@ The family is \em{$(k,c)$-independent} for integer $k\ge 1$ and real $c>0$ iff ...@@ -90,7 +90,7 @@ The family is \em{$(k,c)$-independent} for integer $k\ge 1$ and real $c>0$ iff
for every $k$-tuple $x_1,\ldots,x_k$ of distinct items of~$\cal U$ for every $k$-tuple $x_1,\ldots,x_k$ of distinct items of~$\cal U$
and every $k$-tuple $a_1,\ldots,a_k$ of buckets in~$[m]$, we have and every $k$-tuple $a_1,\ldots,a_k$ of buckets in~$[m]$, we have
$$ $$
\Pr_{h\in{\cal H}} [h(x_1) = a_1 \land \ldots \land h(x_k) = a_k] \le {c\over m^k}. \Prsub{h\in{\cal H}} [h(x_1) = a_1 \land \ldots \land h(x_k) = a_k] \le {c\over m^k}.
$$ $$
That is, if we pick a~hash function~$h$ uniformly at random from~$\cal H$, That is, if we pick a~hash function~$h$ uniformly at random from~$\cal H$,
the probability that the given items are mapped to the given buckets is the probability that the given items are mapped to the given buckets is
...@@ -193,7 +193,7 @@ is $2c$-universal and $(2,4)$-independent. ...@@ -193,7 +193,7 @@ is $2c$-universal and $(2,4)$-independent.
\proof \proof
Consider universality first. For two given items $x_1\ne x_2$, we should show that Consider universality first. For two given items $x_1\ne x_2$, we should show that
$\Pr_{h\in{\cal H}}[h(x_1) = h(x_2)] \le 2c/m$. The event $h(x_1) = h(x_2)$ can be $\Prsub{h\in{\cal H}}[h(x_1) = h(x_2)] \le 2c/m$. The event $h(x_1) = h(x_2)$ can be
written as a~union of disjoint events $h(x_1)=i_1 \land h(x_2)=i_2$ over all written as a~union of disjoint events $h(x_1)=i_1 \land h(x_2)=i_2$ over all
pairs $(i_1,i_2)$ such that $i_1$ is congruent to~$i_2$ modulo~$m$. So we have pairs $(i_1,i_2)$ such that $i_1$ is congruent to~$i_2$ modulo~$m$. So we have
$$ $$
...@@ -208,7 +208,7 @@ needed. ...@@ -208,7 +208,7 @@ needed.
For 2-independence, we proceed in a~similar way. We are given two items $x_1\ne x_2$ For 2-independence, we proceed in a~similar way. We are given two items $x_1\ne x_2$
and two buckets $j_1,j_2\in [m]$. We are bounding and two buckets $j_1,j_2\in [m]$. We are bounding
$$ $$
\Pr_h[h(x_1) \bmod m = j_1 \land h(x_2) \bmod m = j_2] = \Prsub{h}[h(x_1) \bmod m = j_1 \land h(x_2) \bmod m = j_2] =
\sum_{i_1\equiv j_1\atop i_2\equiv j_2} \Pr[h(x_1) = i_1 \land h(h_2) = i_2]. \sum_{i_1\equiv j_1\atop i_2\equiv j_2} \Pr[h(x_1) = i_1 \land h(h_2) = i_2].
$$ $$
Again, each term of the sum is at most $c/r^2$. There are at most $\lceil r/m\rceil \le (r+m-1)/m$ Again, each term of the sum is at most $c/r^2$. There are at most $\lceil r/m\rceil \le (r+m-1)/m$
...@@ -255,8 +255,8 @@ is $(2,c')$-independent for $c' = (cm/r+1)d$. ...@@ -255,8 +255,8 @@ is $(2,c')$-independent for $c' = (cm/r+1)d$.
\proof \proof
Given distinct $x_1, x_2\in {\cal U}$ and $i_1,i_2\in [m]$, we should bound Given distinct $x_1, x_2\in {\cal U}$ and $i_1,i_2\in [m]$, we should bound
$$ $$
\Pr_{h\in{\cal H}} [h(x_1) = i_1 \land h(x_2) = i_2] = \Prsub{h\in{\cal H}} [h(x_1) = i_1 \land h(x_2) = i_2] =
\Pr_{f\in{\cal F}, g\in{\cal G}} \; [g(f(x_1)) = i_1 \land g(f(x_2)) = i_2]. \Prsub{f\in{\cal F}, g\in{\cal G}} \; [g(f(x_1)) = i_1 \land g(f(x_2)) = i_2].
$$ $$
It is tempting to apply 2-independence of~$\cal G$ on the intermediate results $f(x_1)$ It is tempting to apply 2-independence of~$\cal G$ on the intermediate results $f(x_1)$
and $f(x_2)$, but unfortunately we cannot be sure that they are distinct. Fortunately, and $f(x_2)$, but unfortunately we cannot be sure that they are distinct. Fortunately,
...@@ -294,14 +294,14 @@ so $(1 + cm/r)d \le (1+c)d$. ...@@ -294,14 +294,14 @@ so $(1 + cm/r)d \le (1+c)d$.
So far, we constructed $k$-independent families only for $k=2$. Families with So far, we constructed $k$-independent families only for $k=2$. Families with
higher independence can be obtained from polynomials of degree~$k$ over a~field. higher independence can be obtained from polynomials of degree~$k$ over a~field.
\defn{For any field $\Z_p$ and any $k\ge 1$, we define the family of polynomial \defn{For any field $\Zp$ and any $k\ge 1$, we define the family of polynomial
hash functions ${\cal P}_k = \{ h_{\bf a} \mid {\bf a} \in \Z_p^k \}$ from $\Z_p$ to~$\Z_p$, hash functions ${\cal P}_k = \{ h_\a \mid \a \in \Zp^k \}$ from $\Zp$ to~$\Zp$,
where $h_{\bf a}(x) = \sum_{i=0}^{k-1} a_ix^i$.} where $h_\a(x) = \sum_{i=0}^{k-1} a_ix^i$.}
\lemma{The family ${\cal P}$ is $(k,1)$-independent.} \lemma{The family ${\cal P}$ is $(k,1)$-independent.}
\proof \proof
Let $x_1,\ldots,x_k\in\Z_p$ be distinct items and $a_1,\ldots,a_n\in Z_p$ buckets. Let $x_1,\ldots,x_k\in\Zp$ be distinct items and $a_1,\ldots,a_n\in Zp$ buckets.
By standard results on polynomials, there is exactly one polynomial~$h$ of degree at most~$k$ By standard results on polynomials, there is exactly one polynomial~$h$ of degree at most~$k$
such that $h(x_i) = a_i$ for every~$i$. Hence the probability than a~random polynomial such that $h(x_i) = a_i$ for every~$i$. Hence the probability than a~random polynomial
of degree at most~$k$ satisfies this property is $1/p^k$. of degree at most~$k$ satisfies this property is $1/p^k$.
...@@ -398,38 +398,52 @@ a~random vector. ...@@ -398,38 +398,52 @@ a~random vector.
\defn{For a~prime~$p$ and vector size $d\ge 1$, we define the family of \defn{For a~prime~$p$ and vector size $d\ge 1$, we define the family of
scalar product hash functions scalar product hash functions
${\cal S} = \{ h_{\bf a} \mid {\bf a} \in \Z_p^d \}$ from~$\Z_p^d$ to~$\Z_p$, where ${\cal S} = \{ h_\a \mid \a \in \Zp^d \}$ from~$\Zp^d$ to~$\Zp$, where
$h_{\bf a}({\bf x}) = {\bf a} \cdot {\bf x}$. $h_\a(\x) = \a \cdot \x$.
} }
\theorem{The family $\cal S$ is 1-universal. A~function can be picked at random \theorem{The family $\cal S$ is 1-universal. A~function can be picked at random
from~$\cal S$ in time $\Theta(d)$ and evaluated in the same time.} from~$\cal S$ in time $\Theta(d)$ and evaluated in the same time.}
\proof \proof
TODO Consider two distinct vectors $\x, \y \in \Zp^d$. Let $i$ be a~coordinate
for which $\x_i \ne \y_i$. As the vector product does not depend on ordering
of components, we can renumber the components, so that $i=d$.
For a~random choice of the parameter~$\bf a$, we have (in~$\Zp$):
$$\eqalign{
&\Prsub{\a\in\Zp^d} [ h_\a(\x) = h_\a(\y) ] =
\Pr [ \x\cdot\a = \y\cdot\a ] =
\Pr [ (\x-\y)\cdot\a = 0 ] = \cr
&= \Pr \left[ \sum_{i=1}^d (\x_i-\y_i)\a_i = 0 \right] =
\Pr \left[ (\x_d-\y_d)\a_d = -\sum_{i=1}^{d-1} (\x_i-\y_i)\a_i \right]. \cr
}$$
For every choice of $\a_1,\ldots,\a_{d-1}$, the exists exactly one
value of~$\a_d$ for which the last equality holds. Therefore it holds
with probability $1/p$.
\qed \qed
As usually, we can reduce the result modulo~$m<p$. By Lemma~\xx{M}, a~family As usually, we can reduce the result modulo~$m<p$. By Lemma~\xx{M}, a~family
${\cal S}\bmod m$ from~$\Z_p^k$ to $[m]$ is 2-universal. ${\cal S}\bmod m$ from~$\Zp^k$ to $[m]$ is 2-universal.
To obtain 2-independence, we simply compose ${\cal S}$ with the $(2,4)$-independent To obtain 2-independence, we simply compose ${\cal S}$ with the $(2,4)$-independent
family~${\cal L}'$. By Lemma~\xx{G}, the result will be a~$(2,8)$-independent family, family~${\cal L}'$. By Lemma~\xx{G}, the result will be a~$(2,8)$-independent family,
or even $(2,5)$-independent if $p\ge 4m$. or even $(2,5)$-independent if $p\ge 4m$.
The compound hash functions can be written as The compound hash functions can be written as
$(\alpha({\bf a}\cdot {\bf x}) + \beta) \bmod m$, where $(\alpha(\a\cdot \x) + \beta) \bmod m$, where
${\bf a}$ is a~vector parameter, and $\alpha$ and~$\beta$ are scalar parameters. $\a$ is a~vector parameter, and $\alpha$ and~$\beta$ are scalar parameters.
However, $\alpha({\bf a} \cdot {\bf x})$ can be written as ${\bf a}' \cdot {\bf x}$ However, $\alpha(\a \cdot \x)$ can be written as $\a' \cdot \x$
for some vector $\bf a'$ and if $\bf a$ and $\alpha$ were uniformly for some vector $\bf a'$ and if $\bf a$ and $\alpha$ were uniformly
distributed, so is~$\bf a'$. So we can define the compound family in a~more distributed, so is~$\bf a'$. So we can define the compound family in a~more
compact way: compact way:
\defn{For a~prime~$p$, vector size $d\ge 1$, and the number of buckets~$m$, \defn{For a~prime~$p$, vector size $d\ge 1$, and the number of buckets~$m$,
we define the family of scalar product hash functions we define the family of scalar product hash functions
${\cal S}' = \{ h_{{\bf a},\beta} \mid {\bf a}\in \Z_p^d, \beta\in\Z_p \}$ ${\cal S}' = \{ h_{\a,\beta} \mid \a\in \Zp^d, \beta\in\Zp \}$
from~$\Z_p^d$ to $[m]$, where from~$\Zp^d$ to $[m]$, where
$h_{{\bf a},\beta}(x) = ({\bf a}\cdot {\bf x} + \beta) \bmod m$. $h_{\a,\beta}(x) = (\a\cdot \x + \beta) \bmod m$.
(The operations in parentheses are performed in the field~$\Z_p$.) (The operations in parentheses are performed in the field~$\Zp$.)
} }
\theorem{If $p\ge 4m$, the family ${\cal S}'$ is $(2,5)$-independent. \theorem{If $p\ge 4m$, the family ${\cal S}'$ is $(2,5)$-independent.
...@@ -444,8 +458,8 @@ the polynomial is evaluated is chosen randomly. ...@@ -444,8 +458,8 @@ the polynomial is evaluated is chosen randomly.
\defn{ \defn{
For a~prime~$p$ and vector size~$d$, we define the family of polynomial hash functions For a~prime~$p$ and vector size~$d$, we define the family of polynomial hash functions
${\cal R} = \{ h_a \mid a\in\Z_p \}$ from $\Z_p^d$ to~$\Z_p$, where ${\cal R} = \{ h_a \mid a\in\Zp \}$ from $\Zp^d$ to~$\Zp$, where
$h_a({\bf x}) = \sum_{i=0}^{d-1} {\bf x}_{i+1} \cdot a^i$. $h_a(\x) = \sum_{i=0}^{d-1} \x_{i+1} \cdot a^i$.
} }
\theorem{The family~$\cal R$ is $d$-universal. \theorem{The family~$\cal R$ is $d$-universal.
...@@ -453,10 +467,10 @@ A~function can be picked from~$\cal R$ at random in constant time ...@@ -453,10 +467,10 @@ A~function can be picked from~$\cal R$ at random in constant time
and evaluated on a~given vector in $\Theta(d)$ time.} and evaluated on a~given vector in $\Theta(d)$ time.}
\proof \proof
Consider two vectors ${\bf x} \ne {\bf y}$ and a~hash function~$h_a$ chosen at random Consider two vectors $\x \ne \y$ and a~hash function~$h_a$ chosen at random
from~$\cal R$. A~collision happens whenever $\sum_i {\bf x}_{i+1} a^i = \sum_i {\bf y}_{i+1} from~$\cal R$. A~collision happens whenever $\sum_i \x_{i+1} a^i = \sum_i \y_{i+1}
a^i$. This is the same condition as $\sum_i ({\bf x}-{\bf y})_{i+1} a^i = 0$, that is if a^i$. This is the same condition as $\sum_i (\x-\y)_{i+1} a^i = 0$, that is if
the number~$a$ is a~root of the polynomial ${\bf x} - {\bf y}$. Since a~polynomial the number~$a$ is a~root of the polynomial $\x - \y$. Since a~polynomial
of degree at most~$d$ can have at most~$d$ roots (unless it is identically zero), of degree at most~$d$ can have at most~$d$ roots (unless it is identically zero),
the probability that $a$~is a~root is at most $d/p$. This implies $d$-universality. the probability that $a$~is a~root is at most $d/p$. This implies $d$-universality.
\qed \qed
...@@ -486,7 +500,7 @@ $$\eqalign{ ...@@ -486,7 +500,7 @@ $$\eqalign{
H_{j+1} = h(\sigma[j+1],\ldots,\sigma[j+d]) &= \sigma[j+1]a^{d-1} + \sigma[j+2]a^{d-2} + \ldots + \sigma[j+d]a^0, \cr H_{j+1} = h(\sigma[j+1],\ldots,\sigma[j+d]) &= \sigma[j+1]a^{d-1} + \sigma[j+2]a^{d-2} + \ldots + \sigma[j+d]a^0, \cr
}$$ }$$
We can observe that $H_{j+1} = aH_j - \sigma[j]a^d + \sigma[j+d]$. (Everything calculated We can observe that $H_{j+1} = aH_j - \sigma[j]a^d + \sigma[j+d]$. (Everything calculated
in the field~$\Z_p$.) in the field~$\Zp$.)
\subsection{Hashing strings} \subsection{Hashing strings}
......
...@@ -178,10 +178,11 @@ ...@@ -178,10 +178,11 @@
\def\sk#1{{\bf s}^{#1}} \def\sk#1{{\bf s}^{#1}}
\def\ck#1{{\bf c}^{#1}} \def\ck#1{{\bf c}^{#1}}
\def\ek#1{{\bf e}^{#1}} \def\ek#1{{\bf e}^{#1}}
\def\a{{\bf t}}
\def\t{{\bf t}}
\def\x{{\bf x}} \def\x{{\bf x}}
\def\y{{\bf y}} \def\y{{\bf y}}
\def\z{{\bf z}} \def\z{{\bf z}}
\def\t{{\bf t}}
\def\OO{{\bf\Omega}} \def\OO{{\bf\Omega}}
% Transpozice matice % Transpozice matice
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment