Ian Bicking: the old part of his blog

Re: My first bit of ElementTree

"htmlfill would be a lot simpler and more reliable if it didn't use HTMLParser"

Guess it's probably obvious to those that have already realised it but only recently occurred to me that parsing HTML could be alot easier (particularily when it comes to perserving things like whitespace) with a specific parser that gets only what you need while regarding the rest of the document as "just text", vs. a generic HTML parser which is aware of the complete vocabulary.

In PHP there's an excellent lexing tool in SimpleTest which uses "parallel regular expressions" - http://cvs.sourceforge.net/viewcvs.py/simpletest/simpletest/parser.php?view=markup which would suit the job. Don't know what's available for lexing in Python so well but seems like SimpleParse might do the job.

Comment on My first bit of ElementTree
by Harry Fuecks


That's kind of what htmlfill does -- it lets HTMLParser parse the tags, but it just echos out all the parts inbetween the elements it cares about. There's a problem with it eating newlines, but otherwise it seems to work fine. BeautifulSoup is another HTML parser that on a fairly low level.
# Ian Bicking

Sorry - explained myself badly.

Was referring to the process of lexing the raw text in the first place. Rather than using characters like > and < to find tokens, as is common in most HTML parsers (HTMLParser and sgmllib seem to do this), look for specific tags by name while treating all else as unintesting plain text, although it may contain HTML tags we're no interested in. In this case it might amount to some fairly simple regular expressions.

# Harry Fuecks