Comparison of HTML parsers
HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:
- HTML traversal: offer an interface for programmers to easily access and modify of the "HTML string code". Canonical example: DOM parsers.
- HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser | License | Implementation language(s) | Latest date* | HTML parsing[1] | Clean HTML** | Update HTML*** |
---|---|---|---|---|---|---|
html.parser | Python S. F. L. | Python | 2015-02-25[2] | Yes | No | No |
Html Agility Pack | Microsoft Public License | C# | 2014-09-16[3] | Yes | No | ? |
Beautiful Soup (base on lxml and html5lib)[4] | Python S. F. L. | Python | 2015-07-03 | Yes | Yes | Yes |
Gumbo | Apache License 2.0 | C | 2013-08-13 | Yes | ? | ? |
html5lib | MIT License | Python (and PHP, six years ago) | 2013-12-23[5] | Yes | Yes | No |
HTML::Parser | Perl license | Perl | 2013-03-28 | Yes[6] | ? | ? |
htmlPurifier | GNU Lesser GPL | PHP | 2009-03-25[7] | No | Yes | Yes |
HTML Tidy | W3C license | ANSI C | 2015-05-24[8] | No[9] | Yes[10] | Yes[11] |
HtmlUnit | Apache License 2.0 | Java | 2.15 / June 2, 2014 | Yes | No | No |
HtmlCleaner | BSD License[12] | Java | 2015-08-24 | No | Yes | ? |
Hubbub | MIT License | C | 2013-04-19 | Yes | ? | ? |
Jaunt API | Jaunt Beta License | Java | 2013-08-01 | Yes | Yes | No |
Jericho HTML Parser | Eclipse Public License | Java | 2012-10-30[13] | No?? | ? | ? |
jsdom | MIT license | JavaScript | 2013-07-21 | No | ? | ? |
jsoup | MIT license | Java | 2016-04-16[14] | Yes | Yes | Yes |
JTidy | JTidy License | Java | 2012-10-09[15] | No | Yes | ? |
libxml2 HTMLparser | MIT License | C | 2012-09-11[16] | Yes | ? | ? |
NekoHTML | Apache License 2.0 | Java | 2014-06-02[17] | No | ? | ? |
TagSoup | Apache License 2.0 | Java | 2011-07-07 | No | ? | ? |
Validator.nu HTML Parser | MIT License | Java | 2012-06-05 | Yes | ? | ? |
PHP Simple HTML DOM Parser | MIT License | PHP | 2014-08-28 | Yes | No | No |
The PHP DOMDocument-class | PHP License | PHP | 2014-10-04 | Yes | No | No |
Nokogiri | MIT License | Ruby | 2015-01-23[18] | Yes | No | No |
AVHTML | AGPL | C++ | 2015-07-17 | Yes | No | Yes |
Parser | License | Implementation language(s) | Latest date* | HTML Parsing | Clean HTML** | Update HTML*** |
- * Latest release (of significant changes) date.
- ** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
- *** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").
References
- ↑ 12.2 Parsing HTML documents — HTML Standard
- ↑ Python 3.4.3
- ↑ Nuget Html AgilityPack
- ↑ http://www.crummy.com/software/BeautifulSoup/
- ↑ Releases · html5lib/html5lib-python
- ↑ Bug #53300 for HTML-Parser: HTML 5
- ↑ HTML Tidy for Windows
- ↑ HTML Tidy release 4.9.30
- ↑ What is Tidy?
- ↑ What is Tidy?
- ↑ What is Tidy?
- ↑ HtmlCleaner is distributed under BSD License
- ↑ Jericho HTML Parser - Browse /jericho-html/3.3 at SourceForge.net
- ↑ jsoup release 1.9.1
- ↑ JTidy - Browse /JTidy at SourceForge.net
- ↑ libxml2 Releases
- ↑ NekoHTML | Change History
- ↑ Nokogiri release 1.6.6.2
This article is issued from Wikipedia - version of the Tuesday, April 19, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.