Divergence (statistics)

In statistics and information geometry, divergence or a contrast function is a function which establishes the "distance" of one probability distribution to the other on a statistical manifold. The divergence is a weaker notion than that of the distance, in particular the divergence need not be symmetric (that is, in general the divergence from p to q is not equal to the divergence from q to p), and need not satisfy the triangle inequality.

Definition

Suppose S is a space of all probability distributions with common support. Then a divergence on S is a function D(· || ·): S×S → R satisfying ^[1]

D(p || q) ≥ 0 for all p, q ∈ S,
D(p || q) = 0 if and only if p = q,

The dual divergence D* is defined as

D^*(p \parallel q) = D(q \parallel p).

Geometrical properties

Many properties of divergences can be derived if we restrict S to be a statistical manifold, meaning that it can be parametrized with a finite-dimensional coordinate system θ, so that for a distribution p ∈ S we can write p = p(θ).

For a pair of points p, q ∈ S with coordinates θ_p and θ_q, denote the partial derivatives of D(p || q) as

\begin{align} D((\partial_i)_p \parallel q) \ \ &\stackrel{\mathrm{def}}{=}\ \ \tfrac{\partial}{\partial\theta^i_p} D(p \parallel q), \\ D((\partial_i\partial_j)_p \parallel (\partial_k)_q) \ \ &\stackrel{\mathrm{def}}{=}\ \ \tfrac{\partial}{\partial\theta^i_p} \tfrac{\partial}{\partial\theta^j_p}\tfrac{\partial}{\partial\theta^k_q}D(p \parallel q), \ \ \mathrm{etc.} \end{align}

Now we restrict these functions to a diagonal p = q, and denote ^[2]

\begin{align} D[\partial_i\parallel\cdot]\ &:\ p \mapsto D((\partial_i)_p \parallel p), \\ D[\partial_i\parallel\partial_j]\ &:\ p \mapsto D((\partial_i)_p \parallel (\partial_j)_p),\ \ \mathrm{etc.} \end{align}

By definition, the function D(p || q) is minimized at p = q, and therefore

\begin{align} & D[\partial_i\parallel\cdot] = D[\cdot\parallel\partial_i] = 0, \\ & D[\partial_i\partial_j\parallel\cdot] = D[\cdot\parallel\partial_i\partial_j] = -D[\partial_i\parallel\partial_j] \ \equiv\ g_{ij}^{(D)}, \end{align}

where matrix g^(D) is positive semi-definite and defines a unique Riemannian metric on the manifold S.

Divergence D(· || ·) also defines a unique torsion-free affine connection ∇^(D) with coefficients

\Gamma_{ij,k}^{(D)} = -D[\partial_i\partial_j\parallel\partial_k],

and the dual to this connection ∇* is generated by the dual divergence D*.

Thus, a divergence D(· || ·) generates on a statistical manifold a unique dualistic structure (g^(D), ∇^(D), ∇^(D*)). The converse is also true: every torsion-free dualistic structure on a statistical manifold is induced from some globally defined divergence function (which however need not be unique).^[3]

For example, when D is an f-divergence for some function ƒ(·), then it generates the metric g^(D_f) = c·g and the connection ∇^(D_f) = ∇^(α), where g is the canonical Fisher information metric, ∇^(α) is the α-connection, c = ƒ′′(1), and α = 3 + 2ƒ′′′(1)/ƒ′′(1).

Examples

The largest and most frequently used class of divergences form the so-called f-divergences, however other types of divergence functions are also encountered in the literature.

f-divergences

Main article: f-divergence

This family of divergences are generated through functions f(u), convex on u > 0 and such that f(1) = 0. Then an f-divergence is defined as

D_f(p\parallel q) = \int p(x)f\bigg(\frac{q(x)}{p(x)}\bigg) dx

Kullback–Leibler divergence:	$D_\mathrm{KL}(p \parallel q) = \int p(x)\ln\left( \frac{p(x)}{q(x)}\right) dx$
squared Hellinger distance:	$H^2(p,\, q) = 2 \int \Big( \sqrt{p(x)} - \sqrt{q(x)}\, \Big)^2 dx$
Jeffreys divergence:	$D_J(p \parallel q) = \int (p(x) - q(x))\big( \ln p(x) - \ln q(x) \big) dx$
Chernoff's α-divergence:	$D^{(\alpha)}(p \parallel q) = \frac{4}{1-\alpha^2}\bigg(1 - \int p(x)^\frac{1-\alpha}{2} q(x)^\frac{1+\alpha}{2} dx \bigg)$
exponential divergence:	$D_e(p \parallel q) = \int p(x)\big( \ln p(x) - \ln q(x) \big)^2 dx$
Kagan's divergence:	$D_{\chi^2}(p \parallel q) = \frac12 \int \frac{(p(x) - q(x))^2}{p(x)} dx$
(α,β)-product divergence:	$D_{\alpha,\beta}(p \parallel q) = \frac{2}{(1-\alpha)(1-\beta)} \int \Big(1 - \Big(\tfrac{q(x)}{p(x)}\Big)^{\!\!\frac{1-\alpha}{2}} \Big) \Big(1 - \Big(\tfrac{q(x)}{p(x)}\Big)^{\!\!\frac{1-\beta}{2}} \Big) p(x) dx$

References

Amari, Shun-ichi; Nagaoka, Hiroshi (2000). Methods of information geometry. Oxford University Press. ISBN 0-8218-0531-2.
Eguchi, Shinto (1985). "A differential geometric approach to statistical inference on the basis of contrast functionals". Hiroshima mathematical journal 15 (2): 341–391.
Eguchi, Shinto (1992). "Geometry of minimum contrast". Hiroshima mathematical journal 22 (3): 631–647.
Matumoto, Takao (1993). "Any statistical manifold has a contrast function — on the C³-functions taking the minimum at the diagonal of the product manifold". Hiroshima mathematical journal 23 (2): 327–332.

This article is issued from Wikipedia - version of the Thursday, May 21, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.