Comparison of HTML parsers

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

Parser License Implementation language(s) Latest date* HTML parsing[1] Clean HTML** Update HTML***
html.parser Python S. F. L. Python 2015-02-25[2] Yes No No
Html Agility Pack Microsoft Public License C# 2014-09-16[3] Yes No ?
Beautiful Soup (base on lxml and html5lib)[4] Python S. F. L. Python 2015-07-03 Yes Yes Yes
Gumbo Apache License 2.0 C 2013-08-13 Yes ? ?
html5lib MIT License Python (and PHP, six years ago) 2013-12-23[5] Yes Yes No
HTML::Parser Perl license Perl 2013-03-28 Yes[6] ? ?
htmlPurifier GNU Lesser GPL PHP 2009-03-25[7] No Yes Yes
HTML Tidy W3C license ANSI C 2015-05-24[8] No[9] Yes[10] Yes[11]
HtmlUnit Apache License 2.0 Java 2.15 / June 2, 2014 Yes No No
HtmlCleaner BSD License[12] Java 2015-08-24 No Yes ?
Hubbub MIT License C 2013-04-19 Yes ? ?
Jaunt API Jaunt Beta License Java 2013-08-01 Yes Yes No
Jericho HTML Parser Eclipse Public License Java 2012-10-30[13] No?? ? ?
jsdom MIT license JavaScript 2013-07-21 No ? ?
jsoup MIT license Java 2016-04-16[14] Yes Yes Yes
JTidy JTidy License Java 2012-10-09[15] No Yes ?
libxml2 HTMLparser MIT License C 2012-09-11[16] Yes ? ?
NekoHTML Apache License 2.0 Java 2014-06-02[17] No ? ?
TagSoup Apache License 2.0 Java 2011-07-07 No ? ?
Validator.nu HTML Parser MIT License Java 2012-06-05 Yes ? ?
PHP Simple HTML DOM Parser MIT License PHP 2014-08-28 Yes No No
The PHP DOMDocument-class PHP License PHP 2014-10-04 Yes No No
Nokogiri MIT License Ruby 2015-01-23[18] Yes No No
AVHTML AGPL C++ 2015-07-17 Yes No Yes
Parser License Implementation language(s) Latest date* HTML Parsing Clean HTML** Update HTML***
* Latest release (of significant changes) date.
** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").

References

This article is issued from Wikipedia - version of the Tuesday, April 19, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.