Ian Bicking: the old part of his blog

lxml Transformations

In followup to my previous post on form generators, I thought I'd note some implementation details of the pipeline approach I advocate there.

So, one thing I proposed is that we have some notion of requiring some Javascript or CSS from within an HTML element. Let's say it looks like: <input type="date" js-require="DateSelect">. There's a function find_library_url(name, type), called in this case like find_library_url('DateSelect', 'js'), and it returns something like "http://localhost:8080/static/js/DateSelect.js" (we won't worry about how it is implemented).

Here's how you could do this transformation using lxml:

from lxml import etree
from urlparse import urljoin

def resolve_js_require(doc, doc_url, find_library_url):
    if isinstance(doc, basestring):
        doc = etree.HTML(doc)
    script_hrefs = set()
    for el in doc.xpath('//*[@js-require]'):
        name = el.attrib['js-require']
        del el.attrib['js-require']
        url = find_library_url(name, 'js')
        script_hrefs.add(url)
    # Check that we aren't duplicating any explicit <script> tags:
    for el in doc.xpath('//script[@src]'):
        url = urljoin(doc_url, el.attrib['src'])
        if url in script_hrefs:
            script_hrefs.remove(url)
    try:
        head = doc.xpath('//head')[0]
    except IndexError:
        # No <head>
        head = etree.Element('head')
        doc.insert(0, head)
    # Add in the <script> tags:
    for url in script_hrefs:
        el = etree.Element('script')
        el.attrib['type'] = 'text/javascript'
        el.attrib['src'] = url
        head.append(el)
    return doc

Extending this for CSS is hopefully obvious. You can see the code with an explanatory doctest in my recipe repository. Hopefully this example seems easy enough that people will see the benefit of the technique. lxml is a great library, and would be great for cleaning up the implementation of things like the HTMLParser-based monstrosity of htmlfill. (Note: do not use HTMLParser, it's not worth the effort.)

Created 15 Apr

Comments:

This approach looks very interesting, I like it. I gather that this could be implemented in a layer of middleware who's sole responsability would be to handle static resource's dependencies, insert their links at <head> and, possibly, serving them too, right?

It think would be useful to agree in a protocol to pass the element stream through the stack so it plays nice with other XML transforming middleware so the markup only needs to be parsed and serialized once, with any number of transformations in between (Deliverance, etc...) I remember some discussion along this lines in TG's trunk ML some time ago.

ToscaWidgets could piggy back on this to handle its static resources needs. Which brings up the idea that some sort of central registry to map js-require (and css-require) attribute values to files in the filesystem so any library used by a downstream app could regsiter their resources so they can be served and their urls generated. Maybe something like this?:

class MochiKitIter(object):
    type = "javascript"
    depends_on = ["MK-Base", ...]
    location = resource_filename(__name__, "static/Iter.js")

with a foobar.resource_provider entrypoint that mapped js-require="MK-Iter" to this object.

It would also be a natural place to implement the "js/css packaging" ideas discussed here to minimize hits on the server.

Alberto

# Alberto

This approach looks very interesting, I like it. I gather that this could be implemented in a layer of middleware who's sole responsability would be to handle static resource's dependencies, insert their links at <head> and, possibly, serving them too, right?

Yes, that's possible. I would be reluctant to put too much in middleware, but this particular case requires very little context. Combining a server with find_library_url would also make sense. Or you could have a configured system that copies resources into a different location, e.g., where they are served by Apache or some faster server, and then make find_library_url know about that. That there's multiple useful implementations is part of what makes me like the design.

It think would be useful to agree in a protocol to pass the element stream through the stack so it plays nice with other XML transforming middleware so the markup only needs to be parsed and serialized once, with any number of transformations in between (Deliverance, etc...) I remember some discussion along this lines in TG's trunk ML some time ago.

I have been exploring this some in HTTPEncode, but it's been somewhat difficult. But there's been some progress since then (not a lot, but some) -- after some thought I wrote up a strategy for it using WSGI, but no one on the Web-SIG list seemed very interested; I'll probably just move the implementation back into HTTPEncode when I have time.

ToscaWidgets could piggy back on this to handle its static resources needs. Which brings up the idea that some sort of central registry to map js-require (and css-require) attribute values to files in the filesystem so any library used by a downstream app could regsiter their resources so they can be served and their urls generated.

Clearly the nature of that naming becomes pretty important; you want to avoid clashes, but seek out correct overlap. I'm more inclined to use something like OpenJSAN, as it represents a kind of central repository with unique names. Another option is to use some kind of URL/URI, which generally speaking is probably better.

It often annoys me that entry points can't be applied to non-Python objects, like a directory or resource in the package. It might be nice to use a URI, but actually tell the system that the content is provided locally. Then you get better chance of overlap (if two packages require MochiKit, for instance), but you don't make the installation process more complicated (by actually requiring the install to fetch the libraries, for instance).

If going down this track, it would be nice to provide some setuptools extensions for it. E.g., python setup.py refresh_javascript, which would read [refresh_javascript] in setup.cfg and update the package, and maybe (not sure) put another file in .egg-info.

# Ian Bicking

(Note: do not use HTMLParser, it's not worth the effort.)

Um, why?

# Christopher Lenz

It's finicky and difficult to program to. The streaming model doesn't make sense for this set of problems either. If you are using HTML that is frequently found in the wild it fails in lots of ways. You may not encounter these problems from testing, but they can occur later as you expand the sources of content you are dealing with.

# Ian Bicking

html5lib is probably a better bet than lxml, simply because it's pure python and hence easier to install. I'm not sure what your problems with HTMLParser are but I imagine html5lib fixes them.

http://code.google.com/p/html5lib/

# Simon Willison

On Linux systems lxml is generally very easy to install. They also distribute compiled eggs for a variety of platforms.

I don't know about the particulars of how the libxml2 parser works, but it seems to do quite well -- I've had very few problems. html5lib is probably more consistent and reliable, though I would be surprised if libxml2 didn't also pursue all the same techniques. We have had some threading problems with lxml, which have been quite difficult, though I think they only arose when we started sharing the documents between threads.

In terms of speed lxml is going to beat html5lib easily, since there's many high-level operations written in C. It also gives you a document, not just a parser. Passing this document around will be both simpler and faster than parsing/rewriting/serializing a page. I also find the document more convenient than ElementTree, mostly because nodes know about their parents.

If I was writing a spider, web client, or something else that didn't parse content during a web request, html5lib would be more attractive. For this use case -- a parsing and rewriting for every request -- I'd be worried about the performance.

# Ian Bicking

I wonder if you've bumped into Twiddler at all?

http://www.simplistix.co.uk/software/python/twiddler

I've been thinking about getting lxml into the mix with it for some time. The important bits of Twiddler for me are:

I'd be very interested in your thoughts on it, and how it fits with your pipeline generation model...

I hope this blog mails me if you reply, I'd hate to miss an interesting discussion...

# Chris Withers