Parsing HTML is never easy since the markup parsers of commercial browsers are so forgiving of HTML that is simply wrong. Microsoft’s HTML DOM parser worked great but confines the user to Windows. Finding a cross-platform open solution proved to be difficult.
Owl uses three different solutions to accomplish HTML parsing, cleaning and data extraction.
QSGML – The appealing thing about this class were the features that traverse the DOM in search of elements matching a specific query. Hence, your code could easily do something like:
QSgml doc;
doc.parse(html);
doc.getElementsByName(“a”, &tags);
This code returns all the <a> elements in the document into the QList tags. Even better,QSgml::getElementsByName provided overloads that allow you specify elements with specific attributes and attributes with specific values.
However, QSgml often breaks when the HTML parsed has errors such as a mismatched closing.
<html>
<body>
Happy <b><i>birthday</b>!!! Click <a href=”…”>here</a>!!</i>
</body>
</html>
The above HTML any use of QSgml::getElementsByName to always return an empty list. Even worse, there is no way to tell that QSgml doesn’t like the HTML since QSgml::parse still returns true.
Parsing with this library also proved to be quite slow in comparison with a 100k document taking 600 milliseconds to parse.
HTML Tidy – This library turns bad HTML into “good” HTML and can even convert HTML into valid XHTML. The library is relatively easy to use and is easily configurable. The only downside was the size of the library.
Whereas QSgml is four files in total (two .h and two .cpp files) HTML Tidy consists of over 20+ files. Luckily, importing the source into Owl’s existing CMake projects proved to be relatively straightforward though somewhat tedious.
The results, however, were great! Converting a 100k byte HTML document into XHTML averaged about 40-50 milliseconds. This is even more impressive when considering that HTML Tidy also builds a DOM tree while it’s doing it’s “clean and repair”. QSgml on the other hand took nearly 600 milliseconds just to parse the same document. Unfortunately the features available in the DOM were not as strong as those in QSgml.
tinyxml2 – The initial tests with tinyxml2 showed that parsing a ~100k XML document took roughly 150 milliseconds. Combining the time it takes for HTML Tidy to “clean and repair” and for tinyxml2 to parse, this process ended up averaging about 200-250 milliseconds for a 100k document.
Owl currently uses all three classes while only the latter two are truly necessary. Future versions will remove QSgml entirely and use a combination of HTML Tidy and tinyxml2.
For any projects that do not necessarily need to maintain the byte-for-byte structure of HTML, using a combination of HTML Tidy and tinyxml2 might be the best route.
Comments are closed.