Kneser–Ney smoothing

Kneser–Ney smoothing is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories.[1] It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams.

A common example that illustrates the concept behind this method is the frequency of the bigram "San Francisco". If it appears several times in a training corpus, the frequency of the unigram "Francisco" will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results;[2] however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.

Method

Let c(w,w') be the number of occurrences of the word w followed by the word w' in the corpus.

The equation for bigram probabilities is as follows:


p_{KN}(w_i|w_{i-1}) = \frac{\max(c(w_{i-1},w_{i}) - \delta,0)}{\sum_{w'} c(w_{i-1},w')} + \lambda_{w_{i-1}} p_{KN}(w_i) 
[3]

Where the unigram probability p_{KN}(w_i) depends on how likely it is to see the word w_i in an unfamiliar context, which is estimated as the number of times it appears after any other word divided by the number of distinct pairs of consecutive words in the corpus:


p_{KN}(w_i) = \frac { |\{ w' : 0 < c(w',w_i)     \} | } 
                    { |\{ (w',w'') : 0 < c(w',w'') \} | }

Please note that p_{KN} is a proper distribution, as the values defined in above way are non-negative and sum to one.

The parameter \delta is a constant which denotes the discount value subtracted from the count of each n-gram, usually between 0 and 1.

The normalizing constant \lambda_{w_{i-1}} has value chosen carefully to make the sum of conditional probabilities p_{KN}(w_i|w_{i-1}) over all w_i equal to one. Observe that (provided \delta < 1 ) for each w_i which occurs at least once in the context of w_{i-1} in the corpus we discount the probability by exactly the same constant amount {\delta} / \left(\sum_{w'} c(w_{i-1},w')\right), so the total discount depends linearly on the number of unique words w_i that can occur after w_{i-1}. This total discount is a budget we can spread over all p_{KN}(w_i|w_{i-1}) proportionally to p_{KN}(w_i). As the values of p_{KN}(w_i) sum to one, we can simply define \lambda_{w_{i-1}} to be equal to this total discount:


\lambda_{w_{i-1}} = \frac {\delta} {\sum_{w'} c(w_{i-1},w')} |\{ w' : 0 < c(w_{i-1},w') \} |

This equation can be extended to n-grams:


p_{KN}(w_i|w_{i-n+1}^{i-1}) = \frac{\max(c(w_{i-n+1}^{i-1},w_{i}) - \delta,0)}{\sum_{w'} c(w_{i-n+1}^{i-1},w')} + \frac{\delta}{\sum_{w_i} c(w_{i-n+1}^{i})} | \{ w' : 0 < c(w_{i-n+1}^{i-1},w') \} | p_{KN}(w_i|w_{i-n+2}^{i-1})
[4]


This model uses the concept of absolute-discounting interpolation which incorporates information from higher and lower order language models. The addition of the term for lower order n-grams adds more weight to the overall probability when the count for the higher order n-grams is zero.[5] Similarly, the weight of the lower order model decreases when the count of the n-gram is non zero.

References


This article is issued from Wikipedia - version of the Monday, April 18, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.