Hinge loss

Plot of hinge loss (blue) vs. zero-one loss (misclassification, green: y < 0) for t = 1 and variable y. Note that the hinge loss penalizes predictions y < 1, corresponding to the notion of a margin in a support vector machine.

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).[1] For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as

\ell(y) = \max(0, 1-t \cdot y)

Note that y should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, y = \mathbf{w} \cdot \mathbf{x} + b, where (\mathbf{w},b) are the parameters of the hyperplane and \mathbf{x} is the point to classify.

It can be seen that when t and y have the same sign (meaning y predicts the right class) and |y| \ge 1, the hinge loss \ell(y) = 0, but when they have opposite sign, \ell(y) increases linearly with y (one-sided error).

Extensions

While SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2] there exists a "true" multiclass version of the hinge loss due to Crammer and Singer,[3] defined for a linear classifier as[4]

\ell(y) = \max(0, 1 + \max_{t \ne y} \mathbf{w}_t \mathbf{x} - \mathbf{w}_y \mathbf{x})

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where y denotes the SVM's parameters, φ the joint feature function, and Δ the Hamming loss:

\begin{align}
\ell(\mathbf{y}) & = \max(0, \Delta(\mathbf{y}, \mathbf{t}) + \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{y}) \rangle - \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{t}) \rangle) \\
                 & = \max(0, \max_{y \in \mathcal{Y}} \left( \Delta(\mathbf{y}, \mathbf{t}) + \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{y}) \rangle \right) - \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{t}) \rangle)
\end{align}

Optimization

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function y = \mathbf{w} \cdot \mathbf{x} that is given by

\frac{\partial\ell}{\partial w_i} = \begin{cases}
 -t \cdot x_i & \text{if } t \cdot y < 1 \\
 0            & \text{otherwise}
\end{cases}
Plot of three variants of the hinge loss as a function of z = ty: the "ordinary" variant (blue), its square (green), and the piece-wise smooth version by Rennie and Srebro (red).

However, since the derivative of the hinge loss at ty = 1 is non-deterministic, smoothed versions may be preferred for optimization, such as Rennie and Srebro's[5]

\ell(y) = \begin{cases}
\frac{1}{2} - ty       & \text{if} ~~ ty \le 0, \\
\frac{1}{2} (1 - ty)^2 & \text{if} ~~ 0 < ty \le 1, \\
0                      & \text{if} ~~ 1 \le ty
\end{cases}

or the quadratically smoothed

\ell(y) = \frac{1}{2\gamma} \max(0, 1 - ty)^2

suggested by Zhang.[6] The modified Huber loss is a special case of this loss function with \gamma = 2.[6]

References

  1. Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?" (PDF). Neural Computation 16 (5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510.
  2. Duan, K. B.; Keerthi, S. S. (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study". Multiple Classifier Systems (PDF). LNCS 3541. pp. 278–285. doi:10.1007/11494683_28. ISBN 978-3-540-26306-7.
  3. Crammer, Koby; Singer, Yoram (2001). "On the algorithmic implementation of multiclass kernel-based vector machines" (PDF). J. Machine Learning Research 2: 265–292.
  4. Moore, Robert C.; DeNero, John (2011). "L1 and L2 regularization for multiclass hinge loss models" (PDF). Proc. Symp. on Machine Learning in Speech and Language Processing.
  5. Rennie, Jason D. M.; Srebro, Nathan (2005). Loss Functions for Preference Levels: Regression with Discrete Ordered Labels (PDF). Proc. IJCAI Multidisciplinary Workshop on Advances in Preference Handling.
  6. 1 2 Zhang, Tong (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. ICML.
This article is issued from Wikipedia - version of the Tuesday, November 24, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.