Information distance

Information distance is the distance between two finite objects (represented as computer files) expressed as the number of bits in the shortest program which transforms one object into the other one or vice versa on a universal computer. This is an extension of Kolmogorov complexity.^[1] The Kolmogorov complexity of a single finite object is the information in that object; the information distance between a pair of finite objects is the minimum information required to go from one object to the other or vice versa. Information distance was first defined and investigated in ^[2] based on thermodynamic principles, see also.^[3] Subsequently it achieved final form in.^[4] It is applied in the normalized compression distance and the normalized Google distance.

Properties

Formally the information distance $ID(x,y)$ between $x$ and $y$ is defined by

ID(x,y) = \min \{|p|: p(x)=y \; \& \;p(y) =x \},

with $p$ a finite binary program for the fixed universal computer with as inputs finite binary strings $x,y$ . In ^[4] it is proven that $ID(x,y) = E(x,y)+O(\log \cdot \max \{K(x\mid y), K(y\mid x)\} )$ with

E(x,y) = \max \{K(x\mid y), K(y\mid x)\},

where $K(\cdot \mid \cdot)$ is the Kolmogorov complexity defined by ^[1] of the prefix type.^[5] This $E(x,y)$ is the important quantity.

Universality

Let $\Delta$ be the class of upper semicomputable distances $D(x,y)$ that satisfy the density condition

\sum_{x:x \neq y} 2^{-D(x,y)} \leq 1 , \; \sum_{y:y \neq x} 2^{-D(x,y)} \leq 1,

This excludes irrelevant distances such as $D(x,y)= \frac{1}{2}$ for $x\neq y$ ; it takes care that if the distance growth then the number of objects within that distance of a geven object grows. If $D \in \Delta$ then $E(x,y) \leq D(x,y)$ up to a constant additive term.^[4]

Metricity

The distance $E(x,y)$ is a metric up to an additive $O(\log .\max \{K(x\mid y), K(y\mid x)\} )$ term in the metric (in)equalities.^[4]

Maximum overlap

If $E(x,y) = K(x\mid y)$ , then there is a program $p$ of length $K(x\mid y)$ that converts $y$ to $x$ , and a program $q$ of length $K(y\mid x)-K(x\mid y)$ such that the program $qp$ converts $x$ to $y$ . (The programs are of the self-delimiting format which means that one can decide where one program ends and the other begins in concatenation of the programs.) That is, the shortest programs to convert between two objects can be made maximally overlapping: For $K(x\mid y) \leq K(y\mid x)$ it can be divided into a program that converts object $x$ to object $y$ , and another program which concatenated with the first converts $y$ to $x$ while the concatenation of these two programs is a shortest program to convert between these objects.^[4]

Minimum overlap

The programs to convert between objacts $x$ and $y$ can also be made minimal overlapping. There exists a program $p$ of length $K(x\mid y)$ up to an additive term of $O(\log (\max \{K(x\mid y), K(y\mid x)\}) )$ that maps $y$ to $x$ and has small complexity when $x$ is known ( $K(p\mid x)\approx 0$ ). Interchanging the two objects we have the other program^[6] Having in mind the parallelism between Shannon information theory and Kolmogorov complexity theory, one can say that this result is parallel to the Slepian-Wolf and Körner–Imre Csiszár–Marton theorems.

Applications

Theoretical

The result of An.A. Muchnik on minimum overlap above is an important theoretical application showing that certain codes exist: to go to finite target object from any object there is a program which almost only depends on the target object! This result is fairly precise and the error term cannot be significantly improved.^[7] Information distance was material in the textbook,^[8] it occurs in the Encyclopedia on Distances.^[9]

Practical

To determine the similarity of objects such as genomes, languages, music, internet attacks and worms, software programs, and so on, information distance is normalized and the Kolmogorov complexity terms approximated by real-world compressors (the Kolmogorov complexity is a lower bound to the length in bits of a compressed version of the object). The result is the normalized compression distance (NCD) between the objects. This pertains to objects given as computer files like the genome of a mouse or text of a book. If the objects are just given by name such as `Einstein' or `table' or the name of a book or the name `mouse', compression does not make sense. We need outside information about what the name means. Using a data base (such as the internet) and a means to search the database (such as a search engine like Google) provides this information. Every search engine on a data base that provides aggregate page counts can be used in the normalized Google distance (NGD).

References

Related literature

Arkhangel'skii, A. V.; Pontryagin, L. S. (1990), General Topology I: Basic Concepts and Constructions Dimension Theory, Encyclopaedia of Mathematical Sciences, Springer, ISBN 3-540-18178-4

This article is issued from Wikipedia - version of the Tuesday, January 06, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.