Ian Bicking: the old part of his blog

Re: lxml Transformations

html5lib is probably a better bet than lxml, simply because it's pure python and hence easier to install. I'm not sure what your problems with HTMLParser are but I imagine html5lib fixes them.

http://code.google.com/p/html5lib/

Comment on lxml Transformations
by Simon Willison

Comments:

On Linux systems lxml is generally very easy to install. They also distribute compiled eggs for a variety of platforms.

I don't know about the particulars of how the libxml2 parser works, but it seems to do quite well -- I've had very few problems. html5lib is probably more consistent and reliable, though I would be surprised if libxml2 didn't also pursue all the same techniques. We have had some threading problems with lxml, which have been quite difficult, though I think they only arose when we started sharing the documents between threads.

In terms of speed lxml is going to beat html5lib easily, since there's many high-level operations written in C. It also gives you a document, not just a parser. Passing this document around will be both simpler and faster than parsing/rewriting/serializing a page. I also find the document more convenient than ElementTree, mostly because nodes know about their parents.

If I was writing a spider, web client, or something else that didn't parse content during a web request, html5lib would be more attractive. For this use case -- a parsing and rewriting for every request -- I'd be worried about the performance.

# Ian Bicking