HyperLogLog

HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset.[1]

Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of >10^9 with a typical error rate of 2%, using 1.5kB of memory.[1] HyperLogLog is an extension of the earlier LogLog algorithm.[2]

Algorithm

The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2^n.[1]

In the HyperLogLog algorithm, a hash function is applied to each element in the original multiset, to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset. The cardinality of this randomly distributed set can then be estimated using the algorithm above.

The simple estimate of cardinality obtained using the algorithm above has the disadvantage of a large variance. In the HyperLogLog algorithm, the variance is minimised by splitting the multiset into numerous subsets, calculating the maximum number of leading zeros in the numbers in each of these subsets, and using a harmonic mean to combine these estimates for each subset into an estimate of the cardinality of the whole set.

References

  1. 1 2 3 Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm" (PDF). AOFA ’07: Proceedings of the 2007 International Conference on the Analysis of Algorithms.
  2. Durand, M.; Flajolet, P. (2003). "LogLog counting of large cardinalities." (PDF). In G. Di Battista and U. Zwick. Lecture Notes in Computer Science. Annual European Symposium on Algorithms (ESA03). Springer. pp. 605–617.
This article is issued from Wikipedia - version of the Thursday, March 10, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.