Representer theorem

In statistical learning theory, a representer theorem is any of several related results stating that a minimizer f^{*} of a regularized empirical risk function defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.

Formal Statement

The following Representer Theorem and its proof are due to Schölkopf, Herbrich, and Smola:

Theorem: Let \mathcal{X} be a nonempty set and k a positive-definite real-valued kernel on \mathcal{X} \times \mathcal{X} with corresponding reproducing kernel Hilbert space H_k. Given a training sample (x_1, y_1), \dotsc, (x_n, y_n) \in \mathcal{X} \times \R, a strictly monotonically increasing real-valued function g \colon [0, \infty) \to \R, and an arbitrary empirical risk function E \colon (\mathcal{X} \times \R^2)^n \to \R \cup \lbrace \infty \rbrace, then for any f^{*} \in H_k satisfying


 f^{*} = \operatorname{arg min}_{f \in H_k} \left\lbrace E\left( (x_1, y_1, f(x_1)), ..., (x_n, y_n, f(x_n)) \right) + g\left( \lVert f \rVert \right) \right \rbrace, \quad (*)

f^{*} admits a representation of the form:


 f^{*}(\cdot) = \sum_{i = 1}^n \alpha_i k(\cdot, x_i),

where \alpha_i \in \R for all 1 \le i \le n.

Proof: Define a mapping


\begin{align}
 \varphi \colon \mathcal{X} &\to \R^{\mathcal{X}} \\
\varphi(x) &= k(\cdot, x)
\end{align}

(so that \varphi(x) = k(\cdot, x) is itself a map \mathcal{X} \to \R). Since k is a reproducing kernel, then


 \varphi(x)(x') = k(x', x) = \langle \varphi(x'), \varphi(x) \rangle,

where \langle \cdot, \cdot \rangle is the inner product on H_k.

Given any x_1, ..., x_n, one can use orthogonal projection to decompose any f \in H_k into a sum of two functions, one lying in \operatorname{span} \left \lbrace \varphi(x_1), ..., \varphi(x_n) \right \rbrace, and the other lying in the orthogonal complement:


 f = \sum_{i = 1}^n \alpha_i \varphi(x_i) + v,

where \langle v, \varphi(x_i) \rangle = 0 for all i.

The above orthogonal decomposition and the reproducing property together show that applying f to any training point x_j produces


 f(x_j) = \left \langle \sum_{i = 1}^n \alpha_i \varphi(x_i) + v, \varphi(x_j) \right \rangle = \sum_{i = 1}^n \alpha_i \langle \varphi(x_i), \varphi(x_j) \rangle,

which we observe is independent of v. Consequently, the value of the empirical risk E in (*) is likewise independent of v. For the second term (the regularization term), since v is orthogonal to \sum_{i = 1}^n \alpha_i \varphi(x_i) and g is strictly monotonic, we have


\begin{align}
 g\left( \lVert f \rVert \right) &= g \left(  \lVert \sum_{i = 1}^n \alpha_i \varphi(x_i) + v \rVert \right) \\
&= g \left( \sqrt{  \lVert \sum_{i = 1}^n \alpha_i \varphi(x_i)  \rVert^2 + \lVert v \rVert^2} \right) \\
&\ge g \left(  \lVert \sum_{i = 1}^n \alpha_i \varphi(x_i) \rVert \right).
\end{align}

Therefore setting v = 0 does not affect the first term of (*), while it strictly decreasing the second term. Consequently, any minimizer f^{*} in (*) must have v = 0, i.e., it must be of the form


 f^{*}(\cdot) = \sum_{i = 1}^n \alpha_i \varphi(x_i) = \sum_{i = 1}^n \alpha_i k(\cdot, x_i),

which is the desired result.

Generalizations

The Theorem stated above is a particular example of a family of results that are collectively referred to as "Representer Theorems"; here we describe several such.

The first statement of a Representer Theorem was due to Kimeldorf and Wahba for the special case in which


\begin{align}
E\left( (x_1, y_1, f(x_1)), ...,  (x_n, y_n, f(x_n)) \right) &= \frac{1}{n} \sum_{i = 1}^n (f(x_i) - y_i)^2, \\
g(\lVert f \rVert) &= \lambda \lVert f \rVert^2
\end{align}

for \lambda > 0. Schölkopf, Herbrich, and Smola generalized this result by relaxing the assumption of the squared-loss cost and allowing the regularizer to be any strictly monotonically increasing function g(\cdot) of the Hilbert space norm.

It is possible to generalize further by augmenting the regularized empirical risk function through the addition of unpenalized offset terms. For example, Schölkopf, Herbrich, and Smola also consider the minimization


 \tilde{f}^{*} = \operatorname{arg min} \left\lbrace E\left( (x_1, y_1, \tilde{f}(x_1)),  ...,  (x_n, y_n, \tilde{f}(x_n)) \right) + g\left( \lVert f \rVert \right) \mid \tilde{f} = f  + h \in H_k \oplus  \operatorname{span} \lbrace \psi_p \mid 1 \le p \le M \rbrace  \right \rbrace, \quad (\dagger)

i.e., we consider functions of the form \tilde{f} = f + h, where f \in H_k and h is an unpenalized function lying in the span of a finite set of real-valued functions \lbrace \psi_p \colon \mathcal{X} \to \R \mid 1 \le p \le M \rbrace. Under the assumption that the m \times M matrix \left( \psi_p(x_i) \right)_{ip} has rank M, they show that the minimizer \tilde{f}^{*} in (\dagger) admits a representation of the form


 \tilde{f}^{*}(\cdot) = \sum_{i = 1}^n \alpha_i k(\cdot, x_i) + \sum_{p = 1}^M \beta_p \psi_p(\cdot)

where \alpha_i, \beta_p \in \R and the \beta_p are all uniquely determined.

The conditions under which a Representer Theorem exists were investigated by Argyriou, Miccheli, and Pontil, who proved the following:

Theorem: Let \mathcal{X} be a nonempty set, k a positive-definite real-valued kernel on \mathcal{X} \times \mathcal{X} with corresponding reproducing kernel Hilbert space H_k, and let R \colon H_k \to \R be a differentiable regularization function. Then given a training sample (x_1, y_1), ..., (x_n, y_n) \in \mathcal{X} \times \R and an arbitrary empirical risk function E \colon (\mathcal{X} \times \R^2)^m \to \R \cup \lbrace \infty \rbrace, a minimizer


f^{*} =  \operatorname{arg min}_{f \in H_k} \left\lbrace E\left( (x_1, y_1, f(x_1)), ...,  (x_n, y_n, f(x_n)) \right) + R(f) \right \rbrace \quad (\ddagger)

of the regularized empirical risk minimization problem admits a representation of the form


 f^{*}(\cdot) = \sum_{i = 1}^n \alpha_i k(\cdot, x_i),

where \alpha_i \in \R for all 1 \le i \le n, if and only if there exists a nondecreasing function h \colon [0, \infty) \to \R for which


R(f) = h(\lVert f \rVert).

Effectively, this result provides a necessary and sufficient condition on a differentiable regularizer R(\cdot) under which the corresponding regularized empirical risk minimization (\ddagger) will have a Representer Theorem. In particular, this shows that a broad class of regularized risk minimizations (much broader than those originally considered by Kimeldorf and Wahba) have Representer Theorems.

Applications

Representer theorems are useful from a practical standpoint because they dramatically simplify the regularized empirical risk minimization problem (\ddagger). In most interesting applications, the search domain H_k for the minimization will be an infinite-dimensional subspace of L^2(\mathcal{X}), and therefore the search (as written) does not admit implementation on finite-memory and finite-precision computers. In contrast, the representation of f^{*}(\cdot) afforded by a representer theorem reduces the original (infinite-dimensional) minimization problem to a search for the optimal n-dimensional vector of coefficients \alpha = (\alpha_1, ..., \alpha_n) \in \R^n; \alpha can then be obtained by applying any standard function minimization algorithm. Consequently, representer theorems provide the theoretical basis for the reduction of the general machine learning problem to algorithms that can actually be implemented on computers in practice.

See also

References

    This article is issued from Wikipedia - version of the Sunday, March 13, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.