URL normalization

Not to be confused with URL canonicalization.
Types of URL normalization.

URL normalization is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized URL so it is possible to determine if two syntactically different URLs may be equivalent.

Search engines employ URL normalization in order to assign importance to web pages and to reduce indexing of duplicate pages. Web crawlers perform URL normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached.

Normalization process

There are several types of normalization that may be performed. Some of them are always semantics preserving and some may not be.

Normalizations that preserve semantics

The following normalizations are described in RFC 3986 [1] to result in equivalent URLs:

HTTP://www.Example.com/http://www.example.com/
http://www.example.com/a%c2%b1bhttp://www.example.com/a%C2%B1b
http://www.example.com/%7Eusername/http://www.example.com/~username/
http://www.example.com:80/bar.htmlhttp://www.example.com/bar.html

Normalizations that usually preserve semantics

For http and https URLs, the following normalizations listed in RFC 3986 may result in equivalent URLs, but are not guaranteed to by the standards:

http://www.example.com/alicehttp://www.example.com/alice/
However, there is no way to know if a URL path component represents a directory or not. RFC 3986 notes that if the former URL redirects to the latter URL, then that is an indication that they are equivalent.
http://www.example.com/../a/b/../c/./d.htmlhttp://www.example.com/a/c/d.html
However, if a removed ".." component, e.g. "b/..", is a symlink to a directory with a different parent, eliding "b/.." will result in a different path and URL.[3] In rare cases depending on the web server, this may even be true for the root directory (e.g. "//www.example.com/.." may not be equivalent to "//www.example.com/".

Normalizations that change semantics

Applying the following normalizations result in a semantically different URL although it may refer to the same resource:

http://www.example.com/default.asphttp://www.example.com/
http://www.example.com/a/index.htmlhttp://www.example.com/a/
http://www.example.com/bar.html#section1http://www.example.com/bar.html
However, AJAX applications frequently use the value in the fragment.
http://208.77.188.166/http://www.example.com/
The reverse replacement is rarely safe due to virtual web servers.
https://www.example.com/http://www.example.com/
http://www.example.com/foo//bar.htmlhttp://www.example.com/foo/bar.html
http://www.example.com/http://example.com/
http://www.example.com/display?lang=en&article=fredhttp://www.example.com/display?article=fred&lang=en
However, the order of parameters in a URL may be significant (this is not defined by the standard) and a web server may allow the same variable to appear multiple times.[4]
http://www.example.com/display?id=123&fakefoo=fakebarhttp://www.example.com/display?id=123
Note that a parameter without a value is not necessarily an unused parameter.
http://www.example.com/display?id=&sort=ascendinghttp://www.example.com/display
http://www.example.com/display?http://www.example.com/display

Normalization based on URL lists

Some normalization rules may be developed for specific websites by examining URL lists obtained from previous crawls or web server logs. For example, if the URL

http://example.com/story?id=xyz

appears in a crawl log several times along with

http://example.com/story_xyz

we may assume that the two URLs are equivalent and can be normalized to one of the URL forms.

Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URLs with similar text) rules that can be applied to URL lists. They showed that once the correct DUST rules were found and applied with a normalization algorithm, they were able to find up to 68% of the redundant URLs in a URL list.

See also

References

This article is issued from Wikipedia - version of the Monday, April 25, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.