Huber loss

In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used.

Definition

Huber loss (green, \delta=1) and squared error loss (blue) as a function of y - f(x)

The Huber loss function describes the penalty incurred by an estimation procedure f. Huber (1964) defines the loss function piecewise by[1]


L_\delta (a) = \begin{cases}
 \frac{1}{2}{a^2}                   & \text{for } |a| \le \delta, \\
 \delta (|a| - \frac{1}{2}\delta), & \text{otherwise.}
\end{cases}

This function is quadratic for small values of a, and linear for large values, with equal values and slopes of the different sections at the two points where |a| = \delta. The variable a often refers to the residuals, that is to the difference between the observed and predicted values a = y - f(x), so the former can be expanded to[2]


L_\delta(y, f(x)) = \begin{cases}
 \frac{1}{2}(y - f(x))^2                   & \textrm{for } |y - f(x)| \le \delta, \\
 \delta\, |y - f(x)| - \frac{1}{2}\delta^2 & \textrm{otherwise.}
\end{cases}

Motivation

Two very commonly used loss functions are the squared loss, L(a) = a^2, and the absolute loss, L(a)=|a|. The squared loss function results in an arithmetic mean-unbiased estimator, and the absolute-value loss function results in a median-unbiased estimator (in the one-dimensional case, and a geometric median-unbiased estimator for the multi-dimensional case). The squared loss has the disadvantage that it has the tendency to be dominated by outliers—when summing over a set of a's (as in {\textstyle \sum_{i=1}^n L(a_i) }), the sample mean is influenced too much by a few particularly large a-values when the distribution is heavy tailed: in terms of estimation theory, the asymptotic relative efficiency of the mean is poor for heavy-tailed distributions.

As defined above, the Huber loss function is convex in a uniform neighborhood of its minimum a=0, at the boundary of this uniform neighborhood, the Huber loss function has a differentiable extension to an affine function at points  a=-\delta and  a = \delta . These properties allow it to combine much of the sensitivity of the mean-unbiased, minimum-variance estimator of the mean (using the quadratic loss function) and the robustness of the median-unbiased estimator (using the absolute value function).

Pseudo-Huber loss function

The Pseudo-Huber loss function can be used as a smooth approximation of the Huber loss function, and ensures that derivatives are continuous for all degrees. It is defined as[3][4]

L_\delta (a) = \delta^2(\sqrt{1+(a/\delta)^2}-1).

As such, this function approximates a^2/2 for small values of a, and approximates a straight line with slope \delta for large values of a.

While the above is the most common form, other smooth approximations of the Huber loss function also exist.[5]

Variant for classification

For classification purposes, a variant of the Huber loss called modified Huber is sometimes used. Given a prediction f(x) (a real-valued classifier score) and a true binary class label y \in \{+1, -1\}, the modified Huber loss is defined as[6]


L(y, f(x)) = \begin{cases}
 \max(0, 1 - y \, f(x))^2 & \textrm{for }\, \,  y \, f(x) \ge -1, \\
 -4y \, f(x)              & \textrm{otherwise.}
\end{cases}

The term \max(0, 1 - y \, f(x)) is the hinge loss used by support vector machines; the quadratically smoothed hinge loss is a generalization of L.[6]

Applications

The Huber loss function is used in robust statistics, M-estimation and additive modelling.[7]

See also

References

  1. Huber, Peter J. (1964). "Robust Estimation of a Location Parameter". Annals of Statistics 53 (1): 73–101. doi:10.1214/aoms/1177703732. JSTOR 2238020.
  2. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). The Elements of Statistical Learning. p. 349. Compared to Hastie et al., the loss is scaled by a factor of ½, to be consistent with Huber's original definition given earlier.
  3. Charbonnier, P.; Blanc-Feraud, L.; Aubert, G.; Barlaud, M. (1997). "Deterministic edge-preserving regularization in computed imaging". IEEE Trans. Image Processing 6 (2): 298–311. doi:10.1109/83.551699.
  4. Hartley, R.; Zisserman, A. (2003). Multiple View Geometry in Computer Vision (2nd ed.). Cambridge University Press. p. 619. ISBN 0-521-54051-8.
  5. Lange, K. (1990). "Convergence of Image Reconstruction Algorithms with Gibbs Smoothing". IEEE Trans. Medical Imaging 9 (4): 439–446. doi:10.1109/42.61759.
  6. 1 2 Zhang, Tong (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. ICML.
  7. Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine". Annals of Statistics 26 (5): 1189–1232. doi:10.1214/aos/1013203451. JSTOR 2699986.
This article is issued from Wikipedia - version of the Friday, April 15, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.