Hashing: Proof for the scalar product family

8d2da197 · Martin Mareš · a5fa737c · 8d2da197 · 8d2da197
Commit 8d2da197 authored 6 years ago by Martin Mareš
--- a/06-hash/hash.tex
+++ b/06-hash/hash.tex
@@ -21,7 +21,7 @@
 We say that the family is \em{$c$-universal} for some $c>0$ if
 for every pair $x,y\in {\cal U}$ of dictinct items we have
 $$
-	\Pr_{h\in{\cal H}} [h(x) = h(y)] \le {c\over m}.
+	\Prsub{h\in{\cal H}} [h(x) = h(y)] \le {c\over m}.
 $$
 In other words, if we pick a~hash function~$h$ uniformly at random from~$\cal H$,
 the probability that $x$ and~$y$ collide is at most $c$-times more than for
@@ -90,7 +90,7 @@ The family is \em{$(k,c)$-independent} for integer $k\ge 1$ and real $c>0$ iff
 for every $k$-tuple $x_1,\ldots,x_k$ of distinct items of~$\cal U$
 and every $k$-tuple $a_1,\ldots,a_k$ of buckets in~$[m]$, we have
 $$
-	\Pr_{h\in{\cal H}} [h(x_1) = a_1 \land \ldots \land h(x_k) = a_k] \le {c\over m^k}.
+	\Prsub{h\in{\cal H}} [h(x_1) = a_1 \land \ldots \land h(x_k) = a_k] \le {c\over m^k}.
 $$
 That is, if we pick a~hash function~$h$ uniformly at random from~$\cal H$,
 the probability that the given items are mapped to the given buckets is
@@ -193,7 +193,7 @@ is $2c$-universal and $(2,4)$-independent.
 \proof
 Consider universality first. For two given items $x_1\ne x_2$, we should show that
-$\Pr_{h\in{\cal H}}[h(x_1) = h(x_2)] \le 2c/m$. The event $h(x_1) = h(x_2)$ can be
+$\Prsub{h\in{\cal H}}[h(x_1) = h(x_2)] \le 2c/m$. The event $h(x_1) = h(x_2)$ can be
 written as a~union of disjoint events $h(x_1)=i_1 \land h(x_2)=i_2$ over all
 pairs $(i_1,i_2)$ such that $i_1$ is congruent to~$i_2$ modulo~$m$. So we have
 $$
@@ -208,7 +208,7 @@ needed.
 For 2-independence, we proceed in a~similar way. We are given two items $x_1\ne x_2$
 and two buckets $j_1,j_2\in [m]$. We are bounding
 $$
-	\Pr_h[h(x_1) \bmod m = j_1 \land h(x_2) \bmod m = j_2] =
+	\Prsub{h}[h(x_1) \bmod m = j_1 \land h(x_2) \bmod m = j_2] =
 	\sum_{i_1\equiv j_1\atop i_2\equiv j_2} \Pr[h(x_1) = i_1 \land h(h_2) = i_2].
 $$
 Again, each term of the sum is at most $c/r^2$. There are at most $\lceil r/m\rceil \le (r+m-1)/m$
@@ -255,8 +255,8 @@ is $(2,c')$-independent for $c' = (cm/r+1)d$.
 \proof
 Given distinct $x_1, x_2\in {\cal U}$ and $i_1,i_2\in [m]$, we should bound
 $$
-	\Pr_{h\in{\cal H}} [h(x_1) = i_1 \land h(x_2) = i_2] =
+	\Prsub{h\in{\cal H}} [h(x_1) = i_1 \land h(x_2) = i_2] =
-	\Pr_{f\in{\cal F}, g\in{\cal G}} \; [g(f(x_1)) = i_1 \land g(f(x_2)) = i_2].
+	\Prsub{f\in{\cal F}, g\in{\cal G}} \; [g(f(x_1)) = i_1 \land g(f(x_2)) = i_2].
 $$
 It is tempting to apply 2-independence of~$\cal G$ on the intermediate results $f(x_1)$
 and $f(x_2)$, but unfortunately we cannot be sure that they are distinct. Fortunately,
@@ -294,14 +294,14 @@ so $(1 + cm/r)d \le (1+c)d$.
 So far, we constructed $k$-independent families only for $k=2$. Families with
 higher independence can be obtained from polynomials of degree~$k$ over a~field.
-\defn{For any field $\Z_p$ and any $k\ge 1$, we define the family of polynomial
+\defn{For any field $\Zp$ and any $k\ge 1$, we define the family of polynomial
-hash functions ${\cal P}_k = \{ h_{\bf a} \mid {\bf a} \in \Z_p^k \}$ from $\Z_p$ to~$\Z_p$,
+hash functions ${\cal P}_k = \{ h_\a \mid \a \in \Zp^k \}$ from $\Zp$ to~$\Zp$,
-where $h_{\bf a}(x) = \sum_{i=0}^{k-1} a_ix^i$.}
+where $h_\a(x) = \sum_{i=0}^{k-1} a_ix^i$.}
 \lemma{The family ${\cal P}$ is $(k,1)$-independent.}
 \proof
-Let $x_1,\ldots,x_k\in\Z_p$ be distinct items and $a_1,\ldots,a_n\in Z_p$ buckets.
+Let $x_1,\ldots,x_k\in\Zp$ be distinct items and $a_1,\ldots,a_n\in Zp$ buckets.
 By standard results on polynomials, there is exactly one polynomial~$h$ of degree at most~$k$
 such that $h(x_i) = a_i$ for every~$i$. Hence the probability than a~random polynomial
 of degree at most~$k$ satisfies this property is $1/p^k$.
@@ -398,38 +398,52 @@ a~random vector.
 \defn{For a~prime~$p$ and vector size $d\ge 1$, we define the family of
 scalar product hash functions
-${\cal S} = \{ h_{\bf a} \mid {\bf a} \in \Z_p^d \}$ from~$\Z_p^d$ to~$\Z_p$, where
+${\cal S} = \{ h_\a \mid \a \in \Zp^d \}$ from~$\Zp^d$ to~$\Zp$, where
-$h_{\bf a}({\bf x}) = {\bf a} \cdot {\bf x}$.
+$h_\a(\x) = \a \cdot \x$.
 }
 \theorem{The family $\cal S$ is 1-universal. A~function can be picked at random
 from~$\cal S$ in time $\Theta(d)$ and evaluated in the same time.}
 \proof
-TODO
+Consider two distinct vectors $\x, \y \in \Zp^d$. Let $i$ be a~coordinate
+for which $\x_i \ne \y_i$. As the vector product does not depend on ordering
+of components, we can renumber the components, so that $i=d$.
+For a~random choice of the parameter~$\bf a$, we have (in~$\Zp$):
+$$\eqalign{
+&\Prsub{\a\in\Zp^d} [ h_\a(\x) = h_\a(\y) ] =
+\Pr [ \x\cdot\a = \y\cdot\a ] =
+\Pr [ (\x-\y)\cdot\a = 0 ] = \cr
+&= \Pr \left[ \sum_{i=1}^d (\x_i-\y_i)\a_i = 0 \right] =
+\Pr \left[ (\x_d-\y_d)\a_d = -\sum_{i=1}^{d-1} (\x_i-\y_i)\a_i \right]. \cr
+}$$
+For every choice of $\a_1,\ldots,\a_{d-1}$, the exists exactly one
+value of~$\a_d$ for which the last equality holds. Therefore it holds
+with probability $1/p$.
 \qed
 As usually, we can reduce the result modulo~$m<p$. By Lemma~\xx{M}, a~family
-${\cal S}\bmod m$ from~$\Z_p^k$ to $[m]$ is 2-universal.
+${\cal S}\bmod m$ from~$\Zp^k$ to $[m]$ is 2-universal.
 To obtain 2-independence, we simply compose ${\cal S}$ with the $(2,4)$-independent
 family~${\cal L}'$. By Lemma~\xx{G}, the result will be a~$(2,8)$-independent family,
 or even $(2,5)$-independent if $p\ge 4m$.
 The compound hash functions can be written as
-$(\alpha({\bf a}\cdot {\bf x}) + \beta) \bmod m$, where
+$(\alpha(\a\cdot \x) + \beta) \bmod m$, where
-${\bf a}$ is a~vector parameter, and $\alpha$ and~$\beta$ are scalar parameters.
+$\a$ is a~vector parameter, and $\alpha$ and~$\beta$ are scalar parameters.
-However, $\alpha({\bf a} \cdot {\bf x})$ can be written as ${\bf a}' \cdot {\bf x}$
+However, $\alpha(\a \cdot \x)$ can be written as $\a' \cdot \x$
 for some vector $\bf a'$ and if $\bf a$ and $\alpha$ were uniformly
 distributed, so is~$\bf a'$. So we can define the compound family in a~more
 compact way:
 \defn{For a~prime~$p$, vector size $d\ge 1$, and the number of buckets~$m$,
 we define the family of scalar product hash functions
-${\cal S}' = \{ h_{{\bf a},\beta} \mid {\bf a}\in \Z_p^d, \beta\in\Z_p \}$
+${\cal S}' = \{ h_{\a,\beta} \mid \a\in \Zp^d, \beta\in\Zp \}$
-from~$\Z_p^d$ to $[m]$, where
+from~$\Zp^d$ to $[m]$, where
-$h_{{\bf a},\beta}(x) = ({\bf a}\cdot {\bf x} + \beta) \bmod m$.
+$h_{\a,\beta}(x) = (\a\cdot \x + \beta) \bmod m$.
-(The operations in parentheses are performed in the field~$\Z_p$.)
+(The operations in parentheses are performed in the field~$\Zp$.)
 }
 \theorem{If $p\ge 4m$, the family ${\cal S}'$ is $(2,5)$-independent.
@@ -444,8 +458,8 @@ the polynomial is evaluated is chosen randomly.
 \defn{
 For a~prime~$p$ and vector size~$d$, we define the family of polynomial hash functions
-${\cal R} = \{ h_a \mid a\in\Z_p \}$ from $\Z_p^d$ to~$\Z_p$, where
+${\cal R} = \{ h_a \mid a\in\Zp \}$ from $\Zp^d$ to~$\Zp$, where
-$h_a({\bf x}) = \sum_{i=0}^{d-1} {\bf x}_{i+1} \cdot a^i$.
+$h_a(\x) = \sum_{i=0}^{d-1} \x_{i+1} \cdot a^i$.
 }
 \theorem{The family~$\cal R$ is $d$-universal.
@@ -453,10 +467,10 @@ A~function can be picked from~$\cal R$ at random in constant time
 and evaluated on a~given vector in $\Theta(d)$ time.}
 \proof
-Consider two vectors ${\bf x} \ne {\bf y}$ and a~hash function~$h_a$ chosen at random
+Consider two vectors $\x \ne \y$ and a~hash function~$h_a$ chosen at random
-from~$\cal R$. A~collision happens whenever $\sum_i {\bf x}_{i+1} a^i = \sum_i {\bf y}_{i+1}
+from~$\cal R$. A~collision happens whenever $\sum_i \x_{i+1} a^i = \sum_i \y_{i+1}
-a^i$. This is the same condition as $\sum_i ({\bf x}-{\bf y})_{i+1} a^i = 0$, that is if
+a^i$. This is the same condition as $\sum_i (\x-\y)_{i+1} a^i = 0$, that is if
-the number~$a$ is a~root of the polynomial ${\bf x} - {\bf y}$. Since a~polynomial
+the number~$a$ is a~root of the polynomial $\x - \y$. Since a~polynomial
 of degree at most~$d$ can have at most~$d$ roots (unless it is identically zero),
 the probability that $a$~is a~root is at most $d/p$. This implies $d$-universality.
 \qed
@@ -486,7 +500,7 @@ $$\eqalign{
 	H_{j+1} = h(\sigma[j+1],\ldots,\sigma[j+d]) &= \sigma[j+1]a^{d-1} + \sigma[j+2]a^{d-2} + \ldots + \sigma[j+d]a^0, \cr
 }$$
 We can observe that $H_{j+1} = aH_j - \sigma[j]a^d + \sigma[j+d]$. (Everything calculated
-in the field~$\Z_p$.)
+in the field~$\Zp$.)
 \subsection{Hashing strings}

--- a/tex/adsmac.tex
+++ b/tex/adsmac.tex
@@ -178,10 +178,11 @@
 \def\sk#1{{\bf s}^{#1}}
 \def\ck#1{{\bf c}^{#1}}
 \def\ek#1{{\bf e}^{#1}}
+\def\a{{\bf t}}
+\def\t{{\bf t}}
 \def\x{{\bf x}}
 \def\y{{\bf y}}
 \def\z{{\bf z}}
-\def\t{{\bf t}}
 \def\OO{{\bf\Omega}}
 % Transpozice matice