Ian Bicking: the old part of his blog

Re: XML Processing

When you run tidy on the way in, you need to use "-n" (numeric entities) and "-asxml". ElementTree's XML serializer isn't well suited for a tag soup parser (HTML needs special treatment of many tags), so you need to grab a HTML serializer for ET. There's a nice one in Kid.

Alternatively, you can use tidy on the way out to; feeding the XML through "tidy -xml" should work.

Comment on XML Processing
by Fredrik


I looked at Kid's, and was a little confused by the iterating over events/tokens that it used. I wasn't really clear what that internal data structure was. I ended up creating an ElementTree subclass HTMLTree in dumbpath. It might leave out things that Kid does, but it mostly makes sure that empty elements don't get /> and that all other elements use both opening and closing tags. And it strips namespaces.

An HTML serializer would be a nice addition to elementtidy, since reading and writing HTML are operations that often go together.

# Ian Bicking