Universal hashing
In mathematics and computing universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.
Introduction
Assume we want to map keys from some universe 
 into 
 bins (labelled 
). The algorithm will have to handle some data set 
 of 
 keys, which is not known in advance. Usually, the goal of hashing is to obtain a low number of collisions (keys from 
 that land in the same bin). A deterministic hash function cannot offer any guarantee in an adversarial setting if the size of 
 is greater than 
, since the adversary may choose 
 to be precisely the preimage of a bin. This means that all data keys land in the same bin, making hashing useless. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function.
The solution to these problems is to pick a function randomly from a family of hash functions. A family of functions 
 is called a universal family if, 
.
In other words, any two keys of the universe collide with probability at most 
 when the hash function 
 is drawn randomly from 
. This is exactly the probability of collision we would expect if the hash function assigned truly random hash codes to every key. Sometimes, the definition is relaxed to allow collision probability 
. This concept was introduced by Carter and Wegman[1] in 1977, and has found numerous applications in computer science (see, for example [2]). If we have an upper bound of 
 on the collision probability, we say that we have 
-almost universality.
Many, but not all, universal families have the following stronger uniform difference property:
-  
, when 
 is drawn randomly from the family 
, the difference 
 is uniformly distributed in 
. 
Note that the definition of universality is only concerned with whether 
, which counts collisions. The uniform difference property is stronger.
(Similarly, a universal family can be XOR universal if 
, the value 
 is uniformly distributed in 
 where 
 is the bitwise exclusive or operation. This is only possible if 
 is a power of two.)
An even stronger condition is pairwise independence: we have this property when  
 we have the probability that 
 will hash to any pair of hash values 
 is as if they were perfectly random: 
. Pairwise independence is sometimes called strong universality.
Another property is uniformity. We say that a family is uniform if all hash values are equally likely: 
 for any hash value 
. Universality does not imply uniformity. However, strong  universality does imply uniformity.
Given a family with the uniform distance property, one can produce a pairwise independent or strongly universal hash family by adding a uniformly distributed random constant with values in 
 to the hash functions. (Similarly, if 
 is a power of two, we can achieve pairwise independence from an XOR universal hash family by doing an exclusive or with a uniformly distributed random constant.) Since a shift by a constant is sometimes irrelevant in applications (e.g. hash tables), a careful distinction between the uniform distance property and pairwise independent is sometimes not made.[3]
For some applications (such as hash tables), it is important for the least significant bits of the hash values to be also universal. When a family is strongly universal, this is guaranteed: if 
 is a strongly universal family with 
, then the family made of the functions 
 for all 
 is also strongly universal for 
. Unfortunately, the same is not true of (merely) universal families. For example, the family made of the identity function 
 is clearly universal, but the family made of the function 
 fails to be universal.
UMAC and Poly1305-AES and several other message authentication code algorithms are based on universal hashing.[4][5] In such applications, the software chooses a new hash function for every message, based on a unique nonce for that message.
Several hash table implementations are based on universal hashing. In such applications, typically the software chooses a new hash function only after it notices that "too many" keys have collided; until then, the same hash function continues to be used over and over. (Some collision resolution schemes, such as dynamic perfect hashing, pick a new hash function every time there is a collision. Other collision resolution schemes, such as cuckoo hashing and 2-choice hashing, allow a number of collisions before picking a new hash function).
Mathematical guarantees
For any fixed set 
 of 
 keys, using a universal family guarantees the following properties.
-  For any fixed 
 in 
, the expected number of keys in the bin 
 is 
. When implementing hash tables by chaining, this number is proportional to the expected running time of an operation involving the key 
 (for example a query, insertion or deletion). -  The expected number of pairs of keys 
 in 
 with 
 that collide (
) is bounded above by 
, which is of order 
. When the number of bins, 
, is 
, the expected number of collisions is 
. When hashing into 
 bins, there are no collisions at all with probability at least a half. -  The expected number of keys in bins with at least 
 keys in them is bounded above by 
.[6] Thus, if the capacity of each bin is capped to three times the average size (
), the total number of keys in overflowing bins is at most 
. This only holds with a hash family whose collision probability is bounded above by 
. If a weaker definition is used, bounding it by 
, this result is no longer true.[6] 
As the above guarantees hold for any fixed set 
, they hold if the data set is chosen by an adversary. However, the adversary has to make this choice before (or independent of) the algorithm's random choice of a hash function. If the adversary can observe the random choice of the algorithm, randomness serves no purpose, and the situation is the same as deterministic hashing.
The second and third guarantee are typically used in conjunction with rehashing. For instance, a randomized algorithm may be prepared to handle some 
 number of collisions. If it observes too many collisions, it chooses another random 
 from the family and repeats. Universality guarantees that the number of repetitions is a geometric random variable.
Constructions
Since any computer data can be represented as one or more machine words, one generally needs hash functions for three types of domains: machine words ("integers"); fixed-length vectors of machine words; and variable-length vectors ("strings").
Hashing integers
This section refers to the case of hashing integers that fit in machines words; thus, operations like multiplication, addition, division, etc. are cheap machine-level instructions. Let the universe to be hashed be 
.
The original proposal of Carter and Wegman[1] was to pick a prime 
 and define
where 
 are randomly chosen integers modulo 
 with 
.  (This is a single iteration of a linear congruential generator.)
To see that 
 is a universal family, note that 
 only holds when
for some integer 
 between 
 and 
. If 
, their difference, 
 is nonzero and has an inverse modulo 
. Solving for 
 yields
-  
. 
There are 
 possible choices for 
 (since 
 is excluded) and, varying 
 in the allowed range, 
 possible non-zero values for the right hand side. Thus the collision probability is
-  
. 
Another way to see 
 is a universal family is via the notion of statistical distance. Write the difference 
 as
-  
. 
Since 
 is nonzero and 
 is uniformly distributed in 
, it follows that 
 modulo 
 is also uniformly distributed in 
. The distribution of 
 is thus almost uniform, up to a difference in probability of 
 between the samples. As a result, the statistical distance to a uniform family is 
, which becomes negligible when 
.
The family of simpler hash functions
is only approximately universal: 
 for all 
.[1]  Moreover, this analysis is nearly tight; Carter and Wegman [1] show that 
 whenever 
.
Avoiding modular arithmetic
The state of the art for hashing integers is the multiply-shift scheme described by Dietzfelbinger et al. in 1997.[7] By avoiding modular arithmetic, this method is much easier to implement and also runs significantly faster in practice (usually by at least a factor of four[8]). The scheme assumes the number of bins is a power of two, 
. Let 
 be the number of bits in a machine word. Then the hash functions are parametrised over odd positive integers 
 (that fit in a word of 
 bits). To evaluate 
, multiply 
 by 
 modulo 
 and then keep the high order 
 bits as the hash code. In mathematical notation, this is
and it can be implemented in C-like programming languages by
-  
 (unsigned) (a*x) >> (w-M) 
This scheme does not satisfy the uniform difference property and is only 
-almost-universal; for any 
, 
.
To understand the behavior of the hash function, 
notice that, if 
 and 
 have the same highest-order 'M' bits, then 
 has either all 1's or all 0's as its highest order M bits (depending on whether 
 or 
 is larger.
Assume that the least significant set bit of 
 appears on position 
. Since 
 is a random odd integer and odd integers have inverses in the ring 
, it follows that 
 will be uniformly distributed among 
-bit integers with the least significant set bit on position 
.  The probability that these bits are all 0's or all 1's is therefore at most 
.
On the other hand, if 
, then higher-order M bits of 
 contain both 0's and 1's, so 
it is certain that 
.  Finally, if 
 then bit 
 of 
 is 1 and 
 if and only if bits 
 are also 1, which happens with probability 
.
This analysis is tight, as can be shown with the example 
 and 
.  To obtain a truly 'universal' hash function, one can use the multiply-add-shift scheme
which can be implemented in C-like programming languages by
-  
 (unsigned) (a*x+b) >> (w-M) 
where 
 is a random odd positive integer with 
 and 
 is a random non-negative integer with 
.  With these choices of 
 and 
, 
 for all 
.[9] This differs slightly but importantly from the mistranslation in the English paper.[10]
Hashing vectors
This section is concerned with hashing a fixed-length vector of machine words. Interpret the input as a vector 
 of 
 machine words (integers of 
 bits each). If 
 is a universal family with the uniform difference property, the following family (dating back to Carter and Wegman[1]) also has the uniform difference property (and hence is universal):
-  
, where each 
 is chosen independently at random. 
If 
 is a power of two, one may replace summation by exclusive or.[11]
In practice, if double-precision arithmetic is available, this is instantiated with the multiply-shift hash family of.[12] Initialize the hash function with a vector 
 of random odd integers on 
 bits each. Then if the number of bins is 
 for 
:
-  
. 
It is possible to halve the number of multiplications, which roughly translates to a two-fold speed-up in practice.[11] Initialize the hash function with a vector 
 of random odd integers on 
 bits each. The following hash family is universal:[13]
-  
. 
If double-precision operations are not available, one can interpret the input as a vector of half-words (
-bit integers). The algorithm will then use 
 multiplications, where 
 was the number of half-words in the vector. Thus, the algorithm runs at a "rate" of one multiplication per word of input.
The same scheme can also be used for hashing integers, by interpreting their bits as vectors of bytes. In this variant, the vector technique is known as tabulation hashing and it provides a practical alternative to multiplication-based universal hashing schemes.[14]
Strong universality at high speed is also possible.[15] Initialize the hash function with a vector 
 of random  integers on 
 bits. Compute
-  
. 
The result is strongly universal on 
 bits. Experimentally, it was found to run at  0.2 CPU cycle per byte on recent Intel processors for 
.
Hashing strings
This refers to hashing a variable-sized vector of machine words. If the length of the string can be bounded by a small number, it is best to use the vector solution from above (conceptually padding the vector with zeros up to the upper bound). The space required is the maximal length of the string, but the time to evaluate 
 is just the length of 
. As long as zeroes are forbidden in the string, the zero-padding can be ignored when evaluating the hash function without affecting universality[11]). Note that if zeroes are allowed in the string, then it might be best to append a fictitious non-zero (e.g., 1) character to all strings prior to padding: this will ensure that universality is not affected.[15]
Now assume we want to hash 
, where a good bound on 
 is not known a priori. A universal family proposed by [12] 
treats the string 
 as the coefficients of a polynomial modulo a large prime. If 
, let 
 be a prime and define:
, where 
 is uniformly random and 
 is chosen randomly from a universal family mapping integer domain 
.
Using properties of modular arithmetic, above can be computed without producing large numbers for large strings as follows:[16]
uint hash(String x, int a, int p)
	uint h = INITIAL_VALUE
	for (uint i=0 ; i < x.length ; ++i)
		h = ((h*a) + x[i]) mod p
	return h
Above algorithm is also known as Multiplicative hash function.[17] In practice, the mod operator and the parameter p can be avoided altogether by simply allowing integer to overflow because it is equivalent to mod (Max-Int-Value + 1) in many programming languages. Below table shows values chosen to initialize h and a for some of the popular implementations.
| Implementation | INITIAL_VALUE | a | 
|---|---|---|
| Bernstein's hash function djb2[18] | 5381 | 33 | 
| STLPort 4.6.2 | 0 | 5 | 
| Kernighan and Ritchie's hash function[19] | 0 | 31 | 
java.lang.String.hashCode()[20]  | 
0 | 31 | 
Consider two strings 
 and let 
 be length of the longer one; for the analysis, the shorter string is conceptually padded with zeros up to length 
. A collision before applying 
 implies that 
 is a root of the polynomial with coefficients 
. This polynomial has at most 
 roots modulo 
, so the collision probability is at most 
. The probability of collision through the random 
 brings the total collision probability to 
. Thus, if the prime 
 is sufficiently large compared to the length of strings hashed, the family is very close to universal (in statistical distance).
Avoiding modular arithmetic
To mitigate the computational penalty of modular arithmetic, two tricks are used in practice:[11]
-  One chooses the prime 
 to be close to a power of two, such as a Mersenne prime. This allows arithmetic modulo 
 to be implemented without division (using faster operations like addition and shifts). For instance, on modern architectures one can work with 
, while 
's are 32-bit values. -  One can apply vector hashing to blocks. For instance, one applies vector hashing to each 16-word block of the string, and applies string hashing to the 
 results. Since the slower string hashing is applied on a substantially smaller vector, this will essentially be as fast as vector hashing. -  One chooses a power-of-two as the divisor, allowing arithmetic modulo 
 to be implemented without division (using faster operations of bit masking). The NH hash-function family takes this approach. 
See also
- K-independent hashing
 - Rolling hashing
 - Tabulation hashing
 - Min-wise independence
 - Universal one-way hash function
 - Low-discrepancy sequence
 - Perfect hashing
 
References
- 1 2 3 4 5 Carter, Larry; Wegman, Mark N. (1979). "Universal Classes of Hash Functions". Journal of Computer and System Sciences 18 (2): 143–154. doi:10.1016/0022-0000(79)90044-8. Conference version in STOC'77.
 - ↑ Miltersen, Peter Bro. "Universal Hashing". Archived from the original (PDF) on 24 June 2009.
 - ↑ Motwani, Rajeev; Raghavan, Prabhakar (1995). Randomized Algorithms. Cambridge University Press. p. 221. ISBN 0-521-47465-5.
 - ↑ David Wagner, ed. "Advances in Cryptology - CRYPTO 2008". p. 145.
 - ↑ Jean-Philippe Aumasson, Willi Meier, Raphael Phan, Luca Henzen. "The Hash Function BLAKE". 2014. p. 10.
 - 1 2 Baran, Ilya; Demaine, Erik D.; Pătraşcu, Mihai (2008). "Subquadratic Algorithms for 3SUM" (PDF). Algorithmica 50 (4): 584–596. doi:10.1007/s00453-007-9036-3.
 - ↑ Dietzfelbinger, Martin; Hagerup, Torben; Katajainen, Jyrki; Penttonen, Martti (1997). "A Reliable Randomized Algorithm for the Closest-Pair Problem" (Postscript). Journal of Algorithms 25 (1): 19–51. doi:10.1006/jagm.1997.0873. Retrieved 10 February 2011.
 - ↑ Thorup, Mikkel. "Text-book algorithms at SODA".
 - ↑ Woelfel, Philipp (2003). Über die Komplexität der Multiplikation in eingeschränkten Branchingprogrammmodellen (PDF) (Ph.D.). Universität Dortmund. Retrieved 18 September 2012.
 - ↑ Woelfel, Philipp (1999). Efficient Strongly Universal and Optimally Universal Hashing (PDF). Mathematical Foundations of Computer Science 1999. LNCS. pp. 262–272. doi:10.1007/3-540-48340-3_24. Retrieved 17 May 2011.
 - 1 2 3 4 Thorup, Mikkel (2009). String hashing for linear probing. Proc. 20th ACM-SIAM Symposium on Discrete Algorithms (SODA). pp. 655–664. doi:10.1137/1.9781611973068.72. Archived (PDF) from the original on 2013-10-12., section 5.3
 - 1 2 Dietzfelbinger, Martin; Gil, Joseph; Matias, Yossi; Pippenger, Nicholas (1992). Polynomial Hash Functions Are Reliable (Extended Abstract). Proc. 19th International Colloquium on Automata, Languages and Programming (ICALP). pp. 235–246.
 - ↑ Black, J.; Halevi, S.; Krawczyk, H.; Krovetz, T. (1999). UMAC: Fast and Secure Message Authentication (PDF). Advances in Cryptology (CRYPTO '99)., Equation 1
 - ↑ Pătraşcu, Mihai; Thorup, Mikkel (2011). The power of simple tabulation hashing. Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC '11). pp. 1–10. arXiv:1011.5200. doi:10.1145/1993636.1993638.
 - 1 2 Kaser, Owen; Lemire, Daniel (2013). "Strongly universal string hashing is fast". Computer Journal (Oxford University Press). arXiv:1202.4961. doi:10.1093/comjnl/bxt070.
 - ↑ "Hebrew University Course Slides" (PDF).
 - ↑ Kankowsk, Peter. "Hash functions: An empirical comparison".
 - ↑ Yigit, Ozan. "String hash functions".
 - ↑ Kernighan; Ritchie (1988). "6". The C Programming Language (2nd ed.). p. 118. ISBN 0-13-110362-8.
 - ↑ "String (Java Platform SE 6)". docs.oracle.com. Retrieved 2015-06-10.
 
Further reading
- Knuth, Donald Ervin (1998). The Art of Computer Programming, Vol. III: Sorting and Searching (3rd ed.). Reading, Mass; London: Addison-Wesley. ISBN 0-201-89685-0.
 




