Ian Bicking: the old part of his blog

XML Processing

So, I was trying to improve Commentary with respect to its HTML processing. I was parsing the incoming page with tidy, reading it with ElementTree and then writing it back out. But tidy was leaving some entities in there that Expat didn't understand, and ElementTree was outputing XML that doesn't look like HTML. I think I have it fixed (maybe), but it was all much harder than it should have been.

I first started with minidom, (or PyXML?) because I wanted to implement the same algorithms in Javascript, so I preferred a DOM interface. This worked for a little while, but from what I can tell minidom is just really really broken. getElementById didn't work, and after looking at the code it seemed like it couldn't work given the way I was creating the DOM; which was the only way I saw to create the DOM -- there's basically no useful documentation for that module, which is a problem when it doesn't work as claimed. Then later I was getting a problem with inserting nodes, because you can't insert a document-type node (only elements and text nodes and whatnot)... except from what I could tell of the source it was just utterly and completely wrong, and was testing the node type against ELEMENT_TYPE. These are such glaringly obvious errors that I didn't know what to make of them; did I completely not understand what I was doing? Is this code just completely abandoned and unloved?

Anyway, I felt okay about the algorithm by that time anyway, worked around the problems, and then reimplemented using ElementTree. This introduced some problems, because ElementTree doesn't use a model much like the DOM. In the DOM every node knows about its siblings, parent, etc. Elements in ElementTree don't know about any of that (which is conventional in Python and most languages, that you not know about your container). But that was inconvenient, so I had to make a wrapper to give me access to that information.

Then there's the issue that there's no code I know of that knows how to parse HTML (HTMLParser does, of course, but not in a useful way -- it doesn't create a tree). So everyone uses Tidy to normalize their code to XHTML, which works but feels really sloppy. HTML is parseable; in this case, I really only wanted to parse well-formed HTML anyway. Then, finally, there's builtin way to serialize ElementTree to HTML from what I can find. There's some hints, but they still leave you with empty elements like <a name="foo" />, which browsers do not like. I had to clone a write method in ElementTree and make edits to it.

I have to say, Javascript and the DOM in the browsers are much easier to use for HTML processing, even taking into account the fact that it's Javascript.

Created 14 Dec '05

Comments:

When you run tidy on the way in, you need to use "-n" (numeric entities) and "-asxml". ElementTree's XML serializer isn't well suited for a tag soup parser (HTML needs special treatment of many tags), so you need to grab a HTML serializer for ET. There's a nice one in Kid.

Alternatively, you can use tidy on the way out to; feeding the XML through "tidy -xml" should work.

# Fredrik

I looked at Kid's, and was a little confused by the iterating over events/tokens that it used. I wasn't really clear what that internal data structure was. I ended up creating an ElementTree subclass HTMLTree in dumbpath. It might leave out things that Kid does, but it mostly makes sure that empty elements don't get /> and that all other elements use both opening and closing tags. And it strips namespaces.

An HTML serializer would be a nice addition to elementtidy, since reading and writing HTML are operations that often go together.

# Ian Bicking

maybe BeautifulSoup is what you need: http://www.crummy.com/software/BeautifulSoup/

# Lawrence Oluyede

AFAIK, BeautifulSoup structures aren't writable, which is what I'm doing -- parsing HTML, modifying the parsed form (adding comments), then writing it out again.

# Ian Bicking

You can change BeautifulSoup structures. For example, you can insert raw html fragments:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><p> text 1 <p> text 2 </html>')
>>> print soup
<html><body><p> text 1 </p><p> text 2 </p></body></html>
>>> par2 = soup('p')[1]
>>> par2.name = 'div'
>>> par2.contents = ['<p>'] + par2.contents + ['</p>']
>>> print soup
<html><body><p> text 1 </p><div><p> text 2 </p></div></body></html>
# Alexander Kozlovsky

And yes, if the HTML isn't too obnoxious, the HTMLTreeBuilder module might help (and improvements to that module are welcome).

# Fredrik

It looks like Meld3 includes an HTML parser now (and it uses ET internally); probably a good candidate for extraction at some point: http://www.plope.com/software/meld3/

# Ian Bicking

lxml's implementation of ElementTree provides an extension to the ElementTree API that allows you to get the parent. (at least in svn).

One wishlist item is to expose libxml2's HTML parser to lxml somehow. Volunteers are welcome. :) We already have a patch lying around implementing serialization support. I should find the time to review/integrate it all and prepare another release...

# Martijn Faassen

Take a look at this document for some coverage of HTML processing using different XML parsers (including PyXML):

http://www.boddie.org.uk/python/HTML.html

# Paul Boddie

getElementById is tricky:

from xml.dom import minidom

s = '<?xml version="1.0"?><foo><bar id="1" /></foo>'
doc = minidom.parseString(s)
assert None == doc.getElementById('1')

s = '<?xml version="1.0"?><!DOCTYPE quote [ <!ATTLIST bar id ID #IMPLIED> ]> <foo><bar id="1" /></foo>'
doc = minidom.parseString(s)
assert None != doc.getElementById('1')

You've basically got to load in the HTML DTD if you expect getElementById to work.

# Stephen Thorne

That, um, sucks. Geez... the only reason the DOM seems useful to me is that it is implemented in browsers. I'm sure it's implemented and widely used elsewhere (I guess, I don't actually hear people talking about it), but the primary implementation in my mind has always been browsers. Or raise an exception when getElementById can't return a meaningful value. At least it should indicate in the documentation how you make it act like the browser's implementation. But eh... ET is much more predictable and seems to have relatively few intricacies. And it's going to be in the standard library (w00t!), so I'll probably just choose to forget that xml.minidom even exists.

# Ian Bicking

You probably already ran across it but Frederik Lundh also has TidyHTMLTreeBuilder which conveniently wraps up calling tidy on some HTML, and returns an ElementTree tree.

# Ed Summers

"""I have to say, Javascript and the DOM in the browsers are much easier to use for HTML processing, even taking into account the fact that it's Javascript."""

It'll be even much easier once all browsers support E4X :) http://en.wikipedia.org/wiki/E4X

# Sylvain