Ian Bicking: a blog

About

2007-07-31T15:53:00-05:00

Hi, I’m Ian Bicking. I work at The Open Planning Project. I do lots of programming in Python. I keep a list of projects I participate in, though I don’t always update it. Lots of stuff in svn.colorstudy.com and svn.pythonpaste.org is written by me …

New Blog Software (Previous)

2007-08-01T12:02:00-05:00

I’ve switched my software over to WordPress. This was long overdue, as anyone who ever wanted to read anything at all on this site probably knows. Sometime I should really write an article reflecting on the failures of my previous blog software. Lets just say that flat files aren’t so hot either.

Now that my software doesn’t suck, I have lots of posts I have been embarrassed to write because every new post potentially introduced new people to my crappy site.

Hopefully everything is setup correctly, redirects, archives, and the new feed.

My one worry is WordPress comments, which suck a bit. They shouldn’t collect the horrible quantity of spam that the old site has, so that’s good, but I hate disconnected streams of comments. I’ve tried to modify the theme on this site to be more roomy, with less of the excessive whitespace that has become the norm. I hope this whitespace kick goes the way of Creating Killer Websites Using Table Based Layout. I.e., it’ll soon look dated and everyone will move on. So I hope you’ll have more than two inches of width to comment in. Honestly I wonder if I should just ditch WordPress comments and use something else entirely, like some kind of forum software and rig in some way of including the comments in the theme. I wanted to install threaded comments, but the installation process is rather invasive.

For editing I turned TinyMCE off (ugh), and installed a restructured text plugin. It took a while to figure out, since I have to include .. -*- mode: rst -*- in the header of each post. Oh well, a minor inconvenience. I used Text Control to enable Markdown in comments, but I had to replace the actual markdown.php it used, which was broken.

Old Archives

2007-08-01T21:46:00-05:00

Atom Models

2007-08-02T10:30:00-05:00

I’ve been doing a bit more with Atom lately.

First, I started writing a library to manipulate Atom feeds and entries. For the moment this is located in atom.py. It uses lxml, as does everything markup related I do these days.

I came upon a revelation of sorts when I was writing the library. I first started with a library that looked like this:

class Feed(object):
    def __init__(self, title, ...):
        self.title = title
        ..
    @classmethod
    def parse(cls, xml):
        if isinstance(xml, basestring):
            xml = etree.XML(xml)
        title = xml.xpath('//title').text
        ...
        return cls(title, ...)
    def serialize(self):
        el = etree.Element('{%s}feed' % atom_ns)
        title = etree.Element('{%s}title' % atom_ns)
        title.text = self.title
        el.append(title)
        ...
        return el

Obviously there’s ways to improve this and make it less verbose, and I went down that path for a while. But then I decided the whole path was wrong. Atom is XML. It’s not the representation of some object I’m creating. If I have something that can’t be represented in XML, it isn’t Atom, and it doesn’t belong in my Atom-related objects.

So instead I started making lxml more convenient when using Atom. I don’t keep any information except what is in the markup, I just make it more convenient to access that information.

I used lots of descriptors to do this, as the same patterns happened over and over. For instance, the Feed object is fairly simple:

class Feed(AtomElement):
    entries = _findall_property('entry')
    author = _element_property('author')

Which basically means that feed.entries returns all <entry> elements, and feed.author returns the single author element.

There’s also accessors for text elements (like <id>) and date containing elements (like <updated>) and just to access XML attributes as Python attributes.

There’s a number of advantages:

No hidden state.
No deferred errors, since everything is always represented in the XML infoset.
All XML extensions work, even though my classes don’t know anything in particular about them. There’s a full API for manipulating the XML that you can use, you don’t have to use my APIs.
Even more obscure kinds of extensions work fine, like a custom attribute on an element. There’s absolutely zero normalization that happens.
I only have to write the parts where the normal XML (lxml) APIs are inconvenient, so the implementation stays simple.
There’s no confusion over which object I might be talking about in my code. There’s no distinction between the XML object and the domain object.

Since then I’ve been working on a Javascript library for handling Atom. It’s not as elegant. I am trying to keep to this same principle, but of course I can’t actually extend the DOM and so I can’t add convenience methods. So instead I’m making a class that lightly wraps the DOM objects, with explicit getters and setters that simply read and modify those DOM objects.

One thing that I have found very useful in my development on the Javascript side is doctest-style testing. You can see the test, but to run it you have to check it out (it uses some svn:externals which you don’t get through the direct svn access). After using that testing some more and being pleased with the result, I decided to package the Javascript doctest runner a bit better. I removed the framework dependencies, did a bit of renaming (now it is doctestjs or doctest.js instead of jsdoctest), wrote up fairly comprehensive docs, and uploaded it to JSAN (though at the moment the trunk from svn is probably better to use). I think it’s an excellent way of doing unit testing in Javascript, much better than any of the alternatives I’ve seen. It even has some notable advantages over Python’s doctest, like if you are using Firebug (which you must if you do Javascript development) you get a console session that runs in the same namespace as your tests, so you can easily do inspection of the objects if there’s a failure.

I’m not sure about JSAN. It’s nice to have an index. But I think they copy stuff from CPAN a bit too much. Why should you have a text README file? That’s just silly; of course Javascript documentation should be HTML. They batch processing. Processing one package a day on the fly shouldn’t be overwhelming. They want a MANIFEST file. The standard metadata file is YAML, not JSON. This should all be a little more Javascripty in my opinion. But they also accept any kind of upload, so there’s nothing stopping you from ignoring what you don’t care about. I’ll probably improve the packaging of doctestjs a bit in the future, and still ignore the parts I think are silly.

Environmental Theater

2007-08-02T19:38:00-05:00

If you read Bruce Schneier, as any good geek should, you probably are familiar with the term “security theater”: measures that provide the feeling of security while doing little or nothing to actually provide security.

OK, digression. We had this recycling program in Chicago where we put our recyclables in blue bags into the trash, and they pick the blue bags out of the trash. One imagines fancy computerized systems. In reality I think there’s just some people who watch trash go by on a conveyor belt.

This all seemed fishy, but I hate waste on principle so I would dutifully recycled my trash, washed out containers, all that stuff. You’d sometimes hear an environmentalist criticize the program because there was little perceived benefit, and so people didn’t actually recycle much. The system seemed a little improbable to me too, but then I also realized that recycling is a balance and it’s easy to put more effort into recycling programs than is saved through the recycling itself. So maybe this was efficient, all things considered.

Then I learned that actually only 8% of recycling in blue bags is recovered. 92% of the time when I clean things out and put them carefully in their own container, I might as well have just thrown them away. This really pissed me off, because it made it obvious that there never was an honest attempt to reduce waste through recycling. Blue bags were just what they would give people to make them stop complaining about recycling.

The irony is that the environmentalists didn’t complain about the recovery rates (which always were estimated at a low amount). They complained about how many people were recycling. Of course with a recovery rate that low it didn’t matter how many people were recycling. The entire program was a total farce. Now that the program is going away there doesn’t seem to be much anger about how deceptive the program was, and I don’t know if anyone is paying attention to the actual environmental impact of the new program.

Even if they recover the recycling it might still just be a game. Recycling is filled with farce. Metal recycling is great. That’s why there’s trucks that roam the alleys around Chicago looking for scrap metal. There’s a market and someone is willing to pay for the results. There’s not much of a market for anything else; maybe some glass, maybe a little plastic.

People actually get angry when recycling programs restrict the plastics they will take. It doesn’t occur to them that some plastics are simply garbage. They are worthless, and moving them around in special recycling containers just wastes everyone’s time. They are angry because they want to pretend they aren’t being wasteful. They aren’t getting enough environmental theater.

A more concerning kind of environmental theater is ethanol. With an EROI (energy invested vs. energy produced) that hovers just above one, it’s not helping the environment. Biofuels on the whole seem quite questionable. Brazil has more efficient ethanol, but it’s paired with deforestation. A similar thing happens when trees for palm oil replace natural forests. And of course in all these cases, if plants weren’t grown for fuel then plants would be grown for some other purpose. So I can’t really see any advantage in terms of CO2 emissions — and when you consider the relative inefficiency compared to attaining fossil fuels, the net effect of biofuels is probably worse.

Now that environmental concern is mainstream I think we need to be on the watch for environmental theater. Many of the people who play their parts in this theater are well meaning, which can make it awkward. These are people who believe that The Important Thing Is To Raise Awareness. But awareness has been raised, so the time for that kind of bullshit is past. Lying about solutions, exaggerating specific problems, being fuzzy about facts — that’s always been bullshit, and I’ve never found it acceptable. But it’s unfortunately become the norm among advocates of all sorts in these times. The irony is that the advocacy has been done, the case has been made, enough people are convinced, but it may be hard to move beyond the theater to meaningful action. Especially as the well-meaning people are replaced with cynics out to make money.

Pronouncing “Django”

2007-08-02T19:53:00-05:00

I’m not saying this to anyone in particular, but I’ve heard people pronounce Django incorrectly way too often. The “dj” in Django is a hard J, like in the word “jury” or “jolly”. You don’t pronounce the D.

Update: Alex Limi tells me I’m wrong too, and it’s a soft J, like… damn, I can’t think of a word that uses a soft J in English.

I’m not sure I can use that pronunciation, I’m afraid I’ll sound all Frenchy and weird. I’ll give it a go. Zhango zhango zhango… hmm…

Another update: confirming my original pronunciation, Adrian says it is a hard J. Alex is just too European for his own good. Does the debate rage on? Hopefully not.

Fast CGI that isn’t FastCGI

2007-08-03T15:33:00-05:00

There’s a bunch of techniques for doing deployments of long-running processes (Zope, Python server, Rails, etc). A pretty good technique is to do HTTP proxying. There’s some details and conventions I’d like to see for HTTP, but that’s not my concern here.

HTTP proxying isn’t great for commodity hosting. Mostly you need to set up a new long-running process, and commodity hosts don’t make that easy or reliable. FastCGI offers one solution to that, essentially putting the process management into Apache or whatever web server you are using.

The problem with FastCGI is that it is finicky. There’s lots of configuration parameters, lots of parts don’t work right, and there seems to be a golden path where things actually work but it’s hard to know exactly what that is.

Another technique that has been used in the past instead of FastCGI is a very small CGI script. One example in SCGI is called cgi2scgi. This small script is fast to run (it compiles to 12kb), and all it does is take the CGI request and turn it into a SCGI request to a long-running server.

This is a nice start, and easy to deploy, except it doesn’t handle long-running processes. A great feature to add to something like this would be simple process management. I imagine something where if the socket (named or a port) that the cgi2scgi script connects to isn’t up or working, it runs a script that will start the server. If another request comes in while the server is starting up, it shouldn’t try to start the server twice. If the server is randomly killed (as is common on commodity hosters) then the next request will try to bring the server up.

Unlike FastCGI, this won’t try to handle different process models or anything fancy. It’s up to the startup script to set everything up properly, start multiple worker processes if necessary, etc. There’s probably some tricky details I haven’t thought of, and it’s slightly annoying to write all this in C (but necessary, since it’s part of the CGI script, which must be small). But I think it can be done better than existing in-the-wild FastCGI implementations.

And when we’re done, I think we could have something that would be a really good basis for commodity hosting of a whole bunch of non-PHP frameworks. You can distribute the Linux binaries, as all the Commodity Hosts That Matter can run those (even the BSD ones should be fine). Easy application installation practically falls right out of that.

Zonbu & S3

2007-08-04T13:42:00-05:00

I read Edd Dumbill’s post on the Zonbu computer with interest. The Zonbu is a small and inexpensive computer, reminiscent of the Mac Mini but running Linux. The disk is fairly small (4Gb flash) and is intended to serve more as a cache for your network storage than as your primary store.

The network store is a frontend on Amazon S3. This is interesting but confusing, because Zonbu is selling the computer at a price of $99 if you agree to a two year contract for storage at $12.95 a month (about $300 over two years).

The underlying S3 storage is pretty cheap: $0.15 per Gb-month, and $0.10/$0.18 per Gb-upload/download (discounts for higher quantities, which probably Zonbu can get but an individual user couldn’t). So if you are storing, say, 10Gb of data, and retrieving about 10Gb per month (including all the syncing, cache misses, etc), that comes to about $3 per month. Zonbu costs between $0.50 and $0.20 per Gb-month, depending on the plan, and you pay for capacity, not what you actually use (S3 only charges for what you really use). I assume there are bandwidth limits but they aren’t published.

As an aside, I was looking for backup systems for my dad a few months ago, and looked at some of the backup systems that included network storage. They were often in the range of $10-20 per month, and weren’t very high capacity. I came upon S3 Backup, which is a fairly simple Windows program to upload to S3. The price of S3 is way better than any of the other commercial solutions. The billing and account setup isn’t as simple as other systems (since it’s not intended to be), but this seems like something that should be fixed. There should be a consumer version of S3. It could make it easier for software developers to make services for people without actually having to maintain infrastructure. Or maybe more accurately, it would make this possible for open source developers, since we have no interest in being the intermediary for anything as that’s all liability with no payoff. (Or maybe it’s the opposite — only by being an intermediary can you get payoff? The economics of open source get confusing.)

Zonbu, as a device and company, appeals to me. But I can’t help but feel frustrated about the network storage pricing, even though those prices are completely reasonable (and it seems without draconian cancellation fees like mobile phones). Still there’s something about the equation that I just hate — loss leaders, unnecessarily intermediated transactions, hidden costs, and a price structure that depends on people not fully utilizing what they pay for. And I really like the S3 pricing — you pay for what you use and the pricing is completely transparent. What I like about it is that at no point is Amazon expecting you to act irrationally, and for Amazon to profit from your irrational choices. They aren’t expecting you to reserve more than you need. They aren’t going to punish you if you don’t reserve enough.

Another part of why I like S3’s structure is that Amazon (well, Amazon Web Services) owns this particular space in terms of services, and it’s not because of advertising or because they cornered the market or used proprietary anything to restrict choices or made secret business deals with anyone. They simply are providing a service with enough quality and efficiency that no one else can compete (at least at the moment). When quality and efficiency drives market choices it makes me feel all fuzzy and capitalist. This happens infrequently enough that perhaps I get a little overly excitable about resellers with different price structures.

Atompub & OpenID

2007-08-06T11:38:00-05:00

One of the thinmgs I would like to do is to interact with Atompub (aka Atom Publishing Protocol) stores in Javascript through the browser. Since this effectively the browser itself interacting with the Atompub server, browser-like authentication methods would be nice. But services like Atompub don’t work nicely with the kinds of authentication methods that normal websites use. One of these is OpenID, which is particularly browser-focused.

From the perspective of a client, OpenID basically works like this:

You need to login. You tell the original server what your OpenID URL is, somehow.
The original server does some redirects, maybe some popups, etc.
Your OpenID server (attached to your OpenID URL) authenticates you in some fashion, and then tells the original server.
The original server probably sets a signed cookie so that in subsequent requests you stay logged in. You cannot do this little redirection dance for every request, since it’s actually quite intrusive.

So what happens when I have an XMLHttpRequest that needs to be authenticated? Neither the XMLHttpRequest nor Javascript generally can do the authentication. Only the browser can, with the user’s interaction.

One thought I have is a 401 Unauthorized response, with a header like:

WWW-Authenticate: Cookie location="http://original.server/login.html"

Which means I need to open up http://original.server/login.html and have the user log in, and the final result is that a cookie will be set. XMLHttpRequest sends cookies automatically I believe, so once the browser has the cookie then all the Javascript requests get the same cookie and hence authentication.

One problem, though, is that you have to wait around for a while for the login to succede, then continue on your way. A typical situation is that you have to return to the original page you were requesting, and people often do something like /login?redirect_to=original_url. In this case we might want something like /login?opener_call=reattempt_request, where when the login process is over we call window.opener.reattempt_request() in Javascript.

Maybe it would make sense for that location variable to be a URI Template, with some predefined variables, like opener, back, etc.

For general backward compatibility, would it be reasonable to send 307 Temporary Redirect plus WWW-Authenticate, and let XMLHttpRequests or other service clients sort it out, while normal browser requests do the normal login redirect?

Update: Another question/thought: is it okay to send multiple WWW-Authenticate headers, to give the client options for how it wants to do authentication? It seems vaguely okay, according to RFC 2616 14.47.

Tempita

2007-08-06T16:50:00-05:00

I mentioned a templating language I put into Paste a while ago, but since then I extracted it into a separate package called Tempita. I think the documentation is fairly complete (it’s a small language), but I’ll describe it shortly here.

I wanted a text-substitution language, because I wanted something to be used to generate Python files, config files, etc. I also didn’t want a complex API, with search paths and components or something that interacts with import machinery, or any of that. string.Template is almost good enough, but not quite.

I started with the idea of something vaguely like Django Templates, though since I didn’t care about more advanced templating features like blocks that didn’t apply to my use cases. You do variable substitution with {{var|filter}}, and there’s no escape character, and that’s about where the similarity ends.

I realized there was no real reason to use anything but {{...}}, so it’s just {{if expr}}, {{endif}}, etc. There’s an escape for arbitrary Python, similar to how Kid does it — you can have blocks of Python code, but the Python code can only prepare variables and functions, it can’t write anything. I think this gives a nice escape for complex logic (for times when you can’t put the logic in a .py file), without the jumbled mish-mash of languages like PHP where you can trully mix functions and output.

Because it allows Python expressions everywhere, special tags don’t seem so necessary. Instead you can just provide functions to do whatever you need. I wrote a couple little ones as a start. There’s a few things that are awkward still, because there’s no way to define a block of template as a function, or pass the output of a block to a function. I haven’t actually needed these yet, but I can imagine needing this (e.g., when creating nested structures).

I wouldn’t suggest using this templating language in a web application, but I think it can be quite helpful for all the cases where you have to generate text and you aren’t writing a web application (e.g., a Framework Component). In my experience the web templating languages tend to be complex to invoke and understand in these contexts (and Buffet unfortunately doesn’t help in my mind, as it’s loading system is so vague).

XO B4

2007-08-07T22:19:00-05:00

I recently received a Beta-4 XO laptop. I won’t describe the hardware on the whole, but probably a number of readers here have seen the B2 laptops so I thought I’d write up a quick description of the changes I’ve noticed. If you haven’t seen the XO in person, then the minutia of this post may be boring.

First and most substantially, the CPU, memory, and disk have all been upgraded. It now has 256MB RAM, 1GB of flash disk, and a 433MHz Geode processor. This makes a very significant impact on the speed.

It features a big colored XO on the back. Laptops will get different random combinations of X and O colors, so you can tell one laptop from another. I’m a little disappointed to have coincidentally received an X with the same color as the laptop’s green.

The screen now tilts back a bit further than it used to. It’s now comfortable to have it on a table or my lap, where before I liked to have it higher up. Putting the B2 and B4 side-by-side the change in tilt doesn’t seem significant, but using them it’s quite noticeable.

The antenna (“ears”) are now rubber. This is intended to increase its durability when dropped (apparently it can sustain a 1.5 meter drop onto its antenna). Unfortunately along the way the latching mechanism became stiffer, so I don’t let people puzzle out how to open it anymore, it’s requires too much forcing to guess.

The handle is now textured. I never had any problem keeping a grip on it before, but the dots look nice. A cute detail is that around the edge the dots turn into X’s, making little XO figures.

The keyboard has had a few changes. Instead of a slider for the backlight and another slider for the volume, they have been combined into one key with four sensors. The slider that had been used for the backlight is now free to be used by applications. The chat button changed appearances a bit, and it looks like the camera/voice button has been turned into a zoom button. The mouse buttons now have an X on the left button and an O on the right button, to make it easier to refer to them in instructions. The keyboard also is generally more responsive; the spacebar doesn’t seem to have any dead spots anymore, and the keys are more reliable when tapped. It’s still a very small keyboard if you try to touch type, but it’s not impossible (at some point I seem to have lost the ability to hunt and peck, but I can get by).

There are now small white LEDs under the plastic for both the microphone and camera. Whenever these are in use, the light turns on. This is done in hardware as a security measure, so malicious software can’t surreptitiously record things. The plastic around the screen is also now a light color of gray instead of white; from what I understand to make the screen seem higher contrast, I suppose because the white of the plastic could otherwise overpower the white of the screen.

The laptop also came with an LiFePO4 battery, which is lighter and higher capacity than the NiMH batteries used before. The total difference in weight isn’t very noticeable. (Li-Ion batteries haven’t been an option in the XO because of safety concerns.)

The software has had more changes, but that’s an entirely different topic.

Opening Python Classes

2007-08-08T14:03:00-05:00

So, I was reading through comments to despam my old posts before archiving them, and came upon this old reply to this old post of mine which was a reply to this much older post.

I won’t reply to that post much, because it’s mostly… well, not useful to respond to. But people often talk about the wonders of Open Classes in Ruby. For Python people who aren’t familiar with what that means, you can do:

# Somehow acquire SomeClassThatAlreadyExists
class SomeClassThatAlreadyExists
    def some_method(blahblahblah)
        stuff
    end
end

And SomeClassThatAlreadyExists has a some_method added to it (or if that method already exists, then the method is replaced with the new implementation).

In Python when you do this, you’ve defined an entirely new class that just happens to have the name SomeClassThatAlreadyExists. It doesn’t actually effect the original class, and probably will leave you confused because of the two very different classes with the same name. In Ruby when you define a class that already exists, you are extending the class in-place.

You can change Python classes in-place, but there’s no special syntax for it, so people either think you can’t do it, or don’t realize that you are doing the same thing as in Ruby but without the syntactic help. I guess this will be easier with class decorators, but some time ago I also wrote a recipe using normal decorators that looks like this:

@magic_set(SomeClassThatAlreadyExists)
def some_method(self, blahblahblah):
    stuff

The only thing that is even slightly magic about the setting is that I look at the first argument of the function to determine if you are adding an instance, class, or static method to an object, and let you add it to classes or instances. It’s really not that magic, even if it is called magicset.

I think with class decorators you could do this:

@extend(SomeClassThatAlreadyExists)
class SomeClassThatAlreadyExists:
    def some_method(self, blahblahblah):
        stuff

Implemented like this:

def extend(class_to_extend):
    def decorator(extending_class):
        class_to_extend.__dict__.update(extending_class.__dict__)
        return class_to_extend
    return decorator

Defaults & Inheritance

2007-08-10T17:45:00-05:00

I thought I’d note a way I try to make classes reasonably customizable without creating lots of classes, but letting other people create classes if they want.

Here’s a common technique; I’m going to use a class from WSGIProxy as an example, because that’s where I was about to use this technique when I thought it might make an okay post.

In this example there’s a WSGI application that forwards requests to another HTTP server. There’s different ways to forward requests, depending on what kind of data you want to give the remote server about the original request. One example is Zope’s VirtualHostMonster, which takes requests like /VirtualHostBase/http/example.org:80/rootdir/VirtualHostBase/path — the idea is that the server can then realize that the original request was for http://example.org/path (and should ignore any Host headers), and that Zope is supposed to serve that from the internal path /rootdir/path.

There’s a problem with this particular pattern, because there’s no way to mount, say, /blog onto some Zope /sitename/blog-application path, because there’s no concept like in WSGI or CGI of SCRIPT_NAME — the base path of the request. It only handles the base host. So I didn’t just want to settle on that.

I’m kind of inclined to prefer headers, like X-Script-Name: /blog, X-Forwarded-Server: example.org, etc. But I want to support both forms.

The common way to do this is:

class WSGIProxyApp(object):

    def __init__(self, host): ...

    def __call__(self, environ, start_response):
        # actual application interface...
        # Constructs the base request:
        request = self.construct_request(environ)
        # Uses one of these conventions:
        self.update_headers(environ, request)
        ... do stuff with request ...

    def update_headers(self, orig_environ, request):
        raise NotImplementedError

class VirtualHostMonsterApp(WSGIProxyApp):

    def update_headers(self, orig_environ, request):
        request.environ['SCRIPT_NAME'] = (
            '/VirtualHostRoot/%(wsgi.scheme)s/%(HTTP_HOST)s/VirtualHostRoot/'
            % orig_environ)

class HeaderSetterApp(WSGIProxyApp):

    def update_headers(self, orig_environ, request):
        request.environ['HTTP_X_SCRIPT_NAME'] = orig_environ['SCRIPT_NAME']
        # and so on...

Then you use one of the subclasses depending on your needs. Personally I think this really sucks. For one thing, you may have to determine which class to use based on some configuration parameter, which can get awkward. And you might want to subclass the class to change the functionality some yourself, but you have to subclass both of them. There’s patterns to handle this, with policies and factories and other crap; but it’s not a hard problem, and those patterns are hard solutions to a problem that shouldn’t be hard.

Also, it’s harder to inform people about the options available to them, and somewhat harder to use these classes. So I tend to do something like:

class WSGIProxyApp(object):
    default_forwarding_style = 'headers'

    def __init__(self, host, forwarding_style=None):
        ...
        if forwarding_style is None:
            forwarding_style = self.default_forwarding_style
        self.forwarding_style = forwarding_style

    def __call__(self, environ, start_response):
        ...
        method = self.forwarding_style
        if isinstance(method, str):
            method = getattr(self, 'forward_'+self.forwarding_style)
        method(environ, request)
        ...

    def forward_headers(self, orig_environ, request): ...
    def forward_virtual_host_monster(self, orig_environ, request): ...

This way it’s just a simple parameter to change the style. You can pass in your own function, or use one of the named methods already available. The default_forwarding_style class variable lets you change the default in subclasses. If the default was in the function signature it would be much more awkard to change it, because you’d have to override the method and its signature with just that one change, then delegate back to the superclass method.

Atom Publishing Protocol: Atompub

2007-08-12T13:50:00-05:00

Doing stuff with the Atom Publishing Protocol, I’ve noticed that it goes by two (shortened) names: APP and Atompub. I’d become used to calling it APP, but I’ve decided to make a conscious effort to call it Atompub from now on, and I encourage you all to do the same. You cannot usefully search for “APP”, and it’s pronunciation is ambiguous. Atompub is a much better name.

And as long as we’re talking about names, I’ll note that the Cheese Shop is now called PyPI again. I think we are supposed to pronounce it pih-pee, distinct from PyPy which is pie-pie. (Blast, PyPI is down; the Zope guys have been making a static stripped-down mirror for use with Setuptools, over here)

Of Microformats and the Semantic Web

2007-08-14T11:52:00-05:00

I was talking a little with Daniel Krech (author of rdflib) about Semantic Web stuff and microformats and what they all mean. And he was saying that microformats were nice, because you could do something with them, but it would be nice to see that generalized.

By “generalized” I think he meant a general way of expressing arbitrary relationships. As an example, in hCard you can do:

<span class="tel">
  <span class="type">home</span>:
  <span class="value">773-555-3821</span>
</span>

The hCard specification (itself leaning heavily on vCard) defines tel, type, and there’s a general pattern of what value means. But if you want to describe some new kind of structure, there’s no way to do that really; there’s no marital status format, for instance (which would be useful for a singles search engine, as an example).

So I started thinking: can you really generalize it? And I started to think about Joe Gregorio’s attack of WADL:

Here is the very first example in the WADL specification.

That WADL file is a description of a search interface. But here is how you should really do it. That’s an OpenSearch document, that also describes a search interface.

Q: What’s the difference?

A: A mime-type.

Q: That doesn’t seem like much, does it make a difference?

A: Yes, it makes a big difference. When you get an OpenSearch document there is a whole data model and a set of interactions you know are possible because you read the OpenSearch specification. By reading that spec you know how to construct search queries. When I get a WADL document it might describe anything, from how to construct a search, to the APP, to JEP, to XML-RPC.

…

So when I say the difference is a ‘mime-type’, what I mean is that there is an entire spec somewhere which describes what that document means, and that meaning may include hypertext functionality, ala (X)HTML, XForms, and OpenSearch.

This made me think of shared understanding more than explicit descriptions. OpenSearch, APP, and Atom are very well described, but I think that’s only half of it: they are useful when they describe something that many people already understand.

Digressing slightly, one “semantic markup” ideal that still bugs me is <strong> and <em> vs. <b> and <i>. When I compose text I choose to make some words bold and some italic. I have no idea what “strong” and “emphasis” are even supposed to mean. When I’m composing text, I don’t actually know why I choose one or the other. If I sat down and thought about it I’m sure I could come up with a set of rules that describe when bold is appropriate and when italic is appropriate. But that is reflecting on my choice, it is not describing my choice. There is no intermediate semantic meaning between what I am saying and bold and italic. I think in bold and italic. Readers in turn find meaning in the text itself; they do not parse my writing into semantic markup in their brain.

I think there’s some connection between this and the shared understanding that microformats represents, and a more generalized RDF model does not represent. I know what hCard means; not just in an intellectual way, but I can imagine a dozen functional uses of it without hardly trying, and of course I am entirely clear on what contact information means. Moreover, I know what it means without actually figuring out what it means; if you asked me to articulate what contact information means I’d have to think a little, and I’m sure many people would come up with bad answers or be stumped. And yet they all actually understand what it means.

Bringing this back to Joe’s post, if I write something that produces or consumes Atom, Atompub, or OpenSearch, I understand the why of my code. With both WADL and RDF my code is divorced of the why. This isn’t about my personal understanding either; explaining it to me doesn’t serve any purpose, because with any exchange format it has to make sense to many many people to be useful. Even an education campaign won’t fix this: education by description is far inferior to education by doing, and there’s no “doing” to WADL and RDF right now.

That said, what is sufficiently obvious in the future may not be obvious now. Maybe we’ll all get smarter. Maybe someone will pioneer this stuff in a way that is really useful (Facebook?), and grow the public’s intuition about describing relationships in an abstract way. But until then I think microformats are going about this the right way, describing those things that are most easily describable.

Reflection and Description Of Meaning

2007-08-14T14:18:00-05:00

After writing my last post I thought I might follow up with a bit of cognitive speculation. Since the first comment was exactly about the issue I was thinking about writing on, I might as well follow up quickly.

Jeff Snell replied:

You parse semantic markup in rich text all the time. When formatting changes, you apply a reason. RFC’s don’t capitalize MUST and SHOULD because the author is thinking in upper-case versus lower-case. They’re putting a strong emphasis on those words. As a reader, you take special notice of those words being formatted that way and immediately recognize that they contain a special importance. So I think that readers do parse writing into semantic markup inside their brains.

Emphasis not added. Wait, bold isn’t emphasis, it’s strong! So sorry, STRONG not added.

I think the reasoning here is flawed, in that it supposes that reflection on how we think is an accurate way of describing how we think.

A few years ago I got interested in cognition for a while and particularly some of the new theories on consciousness. One of the parts that really stuck with me was the difference in how we think about thinking, and how thinking really works (as revealed with timing experiments). That is, our conscious thought (the thinking-about-thinking) happened after the actual thought; we make up reasons for our actions when we’re challenged, but if we aren’t challenged to explain our actions there’s no consciousness at all (of course, you can challenge yourself to explain your reasoning — but you usually won’t). And then we revise history so that our reasoning precedes our decision, but that’s not always very accurate. This gets around the infinite-loop problem, where either there’s always another level of meta-consciousness reasoning about the lower level of consciousness, or there’s a potentially infinite sequence of whys that have to be answered for every decision. And of course sometimes we really do make rational decisions and there are several levels of why answered before we commit. But this is not the most common case, and there’s always a limit to how much reflection we can do. There are always decisions made without conscious consideration — if only to free ourselves to focus on the important decisions.

And so as both a reader and a writer, I think in terms of italic and bold. As a reader and a writer there is of course translation from one form to another. There’s some idea inside of me that I want to get out in my writing, there’s some idea outside of me that I want to understand as a reader. But just because I can describe some intermediate form of semantic meaning, it doesn’t mean that that meaning is actually there. Instead I invent things like “strong” and “emphasis” when I’m asked to decide why I chose a particular text style. But the real decision is intuitive — I map directly from my ideas to words on the page, or vice versa for reading.

Obviously this is not true for all markup. But my intuition as both a reader and a writer about bold and italic is strong enough that I feel confident there’s no intermediary representation. This is not unlike the fact I don’t consider the phonetics of most words (though admittedly I did when trying to spell “phonetics”); common words are opaque tokens that I read in their entirety without consideration of their component letters. And a good reader reads text words without consideration of their vocal equivalents (though as a writer I read my own writing out loud… is that typical? I’m guessing it is). A good reader can of course vocalize if asked, but that doesn’t mean the vocalization is an accurate representation of their original reading experience.

Though it’s kind of an aside, I think the use of MUST and SHOULD in RFCs fits with this theory. By using all caps they emphasize the word over the prose, they make the reader see the words as tokens unique from “must” and “should”, with special meanings that are related to but also much more strict than their usual English meaning. The caps are a way of disturbing our natural way of determining meaning because they need a more exact language.

DictMixin

2007-08-17T00:02:00-05:00

Quite some time ago I gave a little presentation on DictMixin at ChiPy. If you haven’t used DictMixin before, it’s a class that implements all the derivative methods of dictionaries so you only have to implement the most minimal set: __getitem__, __setitem__, __delitem__, and keys. It’s a lot better than subclassing dict directly, as you have to implement a lot more, and dict implies a specific kind of storage. With DictMixin you can get the information from anywhere.

I thought of a couple examples, and wrote some doctests for them; I thought satisfying the doctests would itself be the presentation. I’m not sure how it worked; it was a fairly experienced crowd, but the switch from code to test can be disorienting.

One of the examples I used was a filesystem access layer. Representing a filesystem as a dictionary is nothing new, but the simplicity of the representation worked well. Here’s how it works:

An FSDict represents one directory.
The keys are the filenames in the directory.
The values are the contents of the files (strings).
When there is a subdirectory, it is another FSDict instance.
When you assign a dictionary-like object to a key, it creates a FSDict from that object.

Dictionaries have lots of methods, like items(), update(), etc. But using DictMixin you just implement the four methods. First, the setup:

class FSDict(DictMixin):
    def __init__(self, path):
        self.path = path

Creation of a dictionary is not part of the dictionary interface. This seems a little strange at first, but the dict class interface isn’t the same as the dictionary instance interface. So FSDict.__init__ doesn’t bear any particular relation to dict.__init__.

Now the other methods… in each case, strings and dictionaries (files and directories) are treated differently.

def __getitem__(self, item):
    fn = os.path.join(self.path, item)
    if not os.path.exists(fn):
        raise KeyError("File %s does not exist" % fn)
    if os.path.isdir(fn):
        return self.__class__(fn)
    f = open(fn, 'rb')
    c = f.read()
    f.close()
    return c

Note the use of self.__class__(fn) instead of FSDict(fn). This makes the class subclassable if you retain the FSDict.__init__ signature. This way subclasses will create new instances using the subclass. Note also that KeyError is part of the dictionary interface (an important part!), so we can’t raise IOError.

Now, assignment…

def __setitem__(self, item, value):
    if item in self:
        del self[item]
    fn = os.path.join(self.path, item)
    if isinstance(value, str):
        f = open(fn, 'wb')
        f.write(value)
        f.close()
    else:
        # Assume it is a dictionary
        os.mkdir(fn)
        f = self[item]
        f.update(value)

Note that with subdirectories (represented as nested dictionaries) we let DictMixin.update do all the hard work, and just create an empty directory to be filled.

Deletion…

def __delitem__(self, item):
    fn = os.path.join(self.path, item)
    if not os.path.exists(fn):
        raise KeyError("File %s does not exist" % fn)
    if os.path.isdir(fn):
        ## one way...
        self[item].clear()
        os.rmdir(fn)
        ## another way...
        #shutil.rmtree(fn)
    else:
        os.unlink(fn)

Enumeration…

def keys(self):
    return os.listdir(self.path)

So, to recursively copy '/foo/bar' to '/dest/path/bar' you do:

FSDict('/dest/path')['bar'] = FSDict('/foo')['bar']

It doesn’t really matter if '/foo/bar' is a directory or file. There’s a number of other clever things that come out of this. I think it’s an example of the power of a closed set — dictionaries are expressable from these four operations, and all the other methods can be derived from there. If you find this interesting, you might want to read the source for DictMixin; it’s only about 95 lines.

My article templating via dict wrappers has some other similar dict tricks.

WebOb

2007-08-18T19:37:00-05:00

I’ve have it in my head to extract/rewrite parts of Paste lately. Tempita was one example.

The request and response functions in Paste grew very organically. I wasn’t trying to create a framework, so I studiously avoided anything that might look like a request or response object. I felt that would be stepping on toes or something. Eventually, though, Ben Bangert really wanted a request object for Pylons, and it went in paste.wsgiwrappers. And at a certain point I decided that the class-based access was really just fine, and doing lots of function(environ, ...) was no better than Request(environ).function(...).

So I started WebOb. WebOb has Request, Response, and some exceptions, incorporating the functionality of Paste’s paste.request, paste.response, paste.wsgilib, paste.httpexceptions, and paste.httpheaders. And some extra stuff.

I’ve included a comparison with a few other framework request/response objects. What this doesn’t note, though, is that WebOb has a much larger Request and Response objects. I’ve taken almost all the HTTP headers and mapped them to parsed attributes. So req.if_modified_since returns a datetime object, and req.if_none_match returns a somewhat set-like object, as a few examples. I created a lot of view-like objects for this, representing the canonical form of the information in several other forms (the WSGI request environment, and the status/headers/body of the response).

It’s fairly well tested and includes almost everything I think it should include, but I reserve the right to change the API any way I want until 1.0; this means if you have any opinion on the API I have nothing to stop me from taking your opinions into account.

Oh, and it has docs, really. They may not be the best docs, but they mention most everything and are automatically tested for accuracy. If you just want a sense of the feel, maybe the file-serving example would be a good place to start (though really you’ll only read about the Response object there).

The Shrinking Python Web Framework World

2007-08-21T23:25:00-05:00

When I was writing the summary of differences between WebOb and other request objects, to remind myself of web frameworks I might have forgotten I went to the WebFrameworks page on the Python wiki.

Looking through that page I’m reminded how many framework options there have been. And I was further reminded of how few relevant options there are now. From all this, there have emerged just a few options: Django, Pylons, TurboGears, Zope. No offense to anyone left out of that list — I know there’s some other actively developed frameworks out there. But frankly they aren’t serious choices; they might be fine internal tools, or interesting experiments, but they are clearly on a different tier (and they all have questionable futures).

And now that TurboGears 2 will be based on Pylons the list looks smaller still.

For a long, long time (longer than most of those frameworks have existed) people have complained about the proliferation of web frameworks in Python. Those of us involved in developing web frameworks in Python haven’t been able to respond all that well. Complaining doesn’t magically lead to solutions, and you can’t just will there to be a single Python web framework. You can work towards that, but that’s what we’ve been doing… mostly people don’t seem to notice. It’s just not an easy thing to work towards; the problem space for a web framework isn’t well defined, its end goal is far more vague than most people immediately realize, and it involves consensus, which makes everything much harder. We said the market would decide, which is kind of a cop out (the market decides through the decisions of developers) but that’s the best answer we had.

But after all this time, it seems clear that we are getting much closer to that goal. If you squint really hard, you can almost imagine we are there. The total list of frameworks only gets longer over time — that’s how open source works — but the list of choices has become quite compact.

How we get to the next level is a little less clear. We’ve gotten this way largely through attrition, but that’s not going to get us any further. I’ll at least assure people that we are discussing this stuff — it’s slow going, but everyone is interested. And if anyone actually wants to do some leg work to move this forward, a lot of the work is actually technical, not political, so don’t be afraid to jump in.

Doctest for Ruby

2007-08-23T11:10:00-05:00

Finally, someone wrote a version of doctest for Ruby.

Recently I’ve been writing most of my tests using stand-alone doctest files. It’s a great way to do TDD — mostly because the cognitive load is so low. Also, I write my examples but don’t write my output, then copy the output after visually confirming it is correct. So the basic pattern is:

Figure out what I want to do
Figure out how I want to test it
Automate my conditions
Manually inspect whether the output is correct (i.e., implement and debug)
Copy the output so that in the future the manual process is automated (doctest-mode for Emacs makes this particularly easy)

The result is a really good balance of manual and automated testing, I think giving you the benefit of both processes — the ease of manual testing, and the robustness of automated testing.

Another good thing about doctest is it doesn’t let you hide any boilerplate and setup. If it’s easy to use doctest, it’s probably easy to use the library.

There’s nothing Python-specific about doctest (e.g., doctestjs), so it’s good to see it moving to other languages. Even if the language doesn’t have a REPL, IMHO it’s worth inventing it just for this.