We can answer on question what is VPS? and what is cheap dedicated servers?

Packaging

Python Application Package

I’ve been thinking some more about deployment of Python web applications, and deployment in general (in part leading up to the Web Summit). And I’ve got an idea.

I wrote about this about a year ago and recently revised some notes on a proposal but I’ve been thinking about something a bit more basic: a way to simply ship server applications, bundles of code. Web applications are just one use case for this.

For now lets call this a "Python application package". It has these features:

  1. There is an application description: this tells the environment about the application. (This is sometimes called "configuration" but that term is very confusing and overloaded; I think "description" is much clearer.)
  2. Given the description, you can create an execution environment to run code from the application and acquire objects from the application. So there would be a specific way to setup sys.path, and a way to indicate any libraries that are required but not bundled directly with the application.
  3. The environment can inject information into the application. (Also this sort of thing is sometimes called "configuration", but let’s not do that either.) This is where the environment could indicate, for instance, what database the application should connect to (host, username, etc).
  4. There would be a way to run commands and get objects from the application. The environment would look in the application description to get the names of commands or objects, and use them in some specific manner depending on the purpose of the application. For instance, WSGI web applications would point the environment to an application object. A Tornado application might simply have a command to start itself (with the environment indicating what port to use through its injection).

There’s a lot of things you can build from these pieces, and in a sophisticated application you might use a bunch of them at once. You might have some WSGI, maybe a seperate non-WSGI server to handle Web Sockets, something for a Celery queue, a way to accept incoming email, etc. In pretty much all cases I think basic application lifecycle is needed: commands to run when an application is first installed, something to verify the environment is acceptable, when you want to back up its data, when you want to uninstall it.

There’s also some things that all environments should setup the same or inject into the application. E.g., $TMPDIR should point to a place where the application can keep its temporary files. Or, every application should have a directory (perhaps specified in another environmental variable) where it can write log files.

Details?

To get more concrete, here’s what I can imagine from a small application description; probably YAML would be a good format:


platform: python, wsgi
require:
  os: posix
  python: <3
  rpm: m2crypto
  deb: python-m2crypto
  pip: requirements.txt
python:
  paths: vendor/
wsgi:
  app: myapp.wsgiapp:application
 

I imagine platform as kind of a series of mixins. This system doesn’t really need to be Python-specific; when creating something similar for Silver Lining I found PHP support relatively easy to add (handling languages that aren’t naturally portable, like Go, might be more of a stretch). So python is one of the features this application uses. You can imagine lots of modularization for other features, but it would be easy and unproductive to get distracted by that.

The application has certain requirements of its environment, like the version of Python and the general OS type. The application might also require libraries, ideally one libraries that are not portable (M2Crypto being an example). Modern package management works pretty nicely for this stuff, so relying on system packages as a first try I believe is best (I’d offer requirements.txt as a fallback, not as the primary way to handle dependencies).

I think it’s much more reliable if applications primarily rely on bundling their dependencies directly (i.e., using a vendor directory). The tool support for this is a bit spotty, but I believe this package format could clarify the problems and solutions. Here is an example of how you might set up a virtualenv environment for managing vendor libraries (you then do not need virtualenv to use those same libraries), and do so in a way where you can check the results into source control. It’s kind of complicated, but works (well, almost works – bin/ files need fixing up). It’s a start at least.

Support Library

On the environment side we need a good support library. pywebapp has some of the basic features, though it is quite incomplete. I imagine a library looking something like this:


from apppackage import AppPackage
app = AppPackage('/var/apps/app1.2012.02.11')
# Maybe a little Debian support directly:
subprocess.call(['apt-get', 'install'] +
                app.config['require']['deb'])
# Or fall back of virtualenv/pip
app.create_virtualenv('/var/app/venvs/app1.2012.02.11')
app.install_pip_requirements()
wsgi_app = app.load_object(app.config['wsgi']['app'])
 

You can imagine building hosting services on this sort of thing, or setting up continuous integration servers (app.run_command(app.config['unit_test'])), and so forth.

Local Development

If designed properly, I think this format is as usable for local development as it is for deployment. It should be able to run directly from a checkout, with the "development environment" being an environment just like any other.

This rules out, or at least makes less exciting, the use of zip files or tarballs as a package format. The only justification I see for using such archives is that they are easy to move around; but we live in the FUTURE and there are many ways to move directories around and we don’t need to cater to silly old fashions. If that means a script that creates a tarball, FTPs it to another computer, and there it is unzipped, then fine – this format should not specify anything about how you actually deliver the files. But let’s not worry about copying WARs.

Packaging
Python
Silver Lining
Web

Comments (9)

Permalink

A Python Web Application Package and Format (we should make one)

At PyCon there was an open space about deployment, and the idea of drop-in applications (Java-WAR-style).

I generally get pessimistic about 80% solutions, and dropping in a WAR file feels like an 80% solution to me. I’ve used the Hudson/Jenkins installer (which I think is specifically a project that got WARs on people’s minds), and in a lot of ways that installer is nice, but it’s also kind of wonky, it makes configuration unclear, it’s not always clear when it installs or configures itself through the web, and when you have to do this at the system level, nor is it clear where it puts files and data, etc. So a great initial experience doesn’t feel like a great ongoing experience to me — and it doesn’t have to be that way. If those were necessary compromises, sure, but they aren’t. And because we don’t have WAR files, if we’re proposing to make something new, then we have every opportunity to make things better.

So the question then is what we’re trying to make. To me: we want applications that are easy to install, that are self-describing, self-configuring (or at least guide you through configuration), reliable with respect to their environment (not dependent on system tweaking), upgradable, and respectful of persistence (the data that outlives the application install). A lot of this can be done by the "container" (to use Java parlance; or "environment") — if you just have the app packaged in a nice way, the container (server environment, hosting service, etc) can handle all the system-specific things to make the application actually work.

At which point I am of course reminded of my Silver Lining project, which defines something very much like this. Silver Lining isn’t just an application format, and things aren’t fully extracted along these lines, but it’s pretty close and it addresses a lot of important issues in the lifecycle of an application. To be clear: Silver Lining is an application packaging format, a server configuration library, a cloud server management tool, a persistence management tool, and a tool to manage the application with respect to all these services over time. It is a bunch of things, maybe too many things, so it is not unreasonable to pick out a smaller subset to focus on. Maybe an easy place to start (and good for Silver Lining itself) would be to separate at least the application format (and tools to manage applications in that state, e.g., installing new libraries) from the tools that make use of such applications (deploy, etc).

Some opinions I have on this format, exemplified in Silver Lining:

  • It’s not zipped or a single file, unlike WARs. Uploading zip files is not a great API. Geez. I know there’s this desire to "just drop in a file"; but there’s no getting around the fact that "dropping a file" becomes a deployment protocol and it’s an incredibly impoverished protocol. The format is also not subtly git-based (ala Heroku) because git push is not a good deployment protocol.
  • But of course there isn’t really any deployment protocol inferred by a format anyway, so maybe I’m getting ahead of myself ;) I’m saying a tool that deploys should take as an argument a directory, not a single file. (If the tool then zips it up and uploads it, fine!)
  • Configuration "comes from the outside". That is, an application requests services, and the container tells the application where those services are. For Silver Lining I’ve used environmental variables. I think this one point is really important — the container tells the application. As a counter-example, an application that comes with a Puppet deployment recipe is essentially telling the server how to arrange itself to suit the application. This will never be reliable or simple!
  • The application indicates what "services" it wants; for instance, it may want to have access to a MySQL database. The container then provides this to the application. In practice this means installing the actual packages, but also creating a database and setting up permissions appropriately. The alternative is never having any dependencies, meaning you have to use SQLite databases or ad hoc structures, etc. But in fact installing databases really isn’t that hard these days.
  • All persistence has to use a service of some kind. If you want to be able to write to files, you need to use a file service. This means the container is fully aware of everything the application is leaving behind. All the various paths an application should use are given in different environmental variables (many of which don’t need to be invented anew, e.g., $TMPDIR).
  • It uses vendor libraries exclusively for Python libraries. That means the application bundles all the libraries it requires. Nothing ever gets installed at deploy-time. This is in contrast to using a requirements.txt list of packages at deployment time. If you want to use those tools for development that’s fine, just not for deployment.
  • There is also a way to indicate other libraries you might require; e.g., you might lxml, or even something that isn’t quite a library, like git (if you are making a github clone). You can’t do those as vendor libraries (they include non-portable binaries). Currently in Silver Lining the application description can contain a list of Ubuntu package names to install. Of course that would have to be abstracted some.
  • You can ask for scripts or a request to be invoked for an application after an installation or deployment. It’s lame to try to test if is-this-app-installed on every request, which is the frequent alternative. Also, it gives the application the chance to signal that the installation failed.
  • It has a very simple (possibly/probably too simple) sense of configuration. You don’t have to use this if you make your app self-configuring (i.e., build in a web-accessible settings screen), but in practice it felt like some simple sense of configuration would be helpful.

Things that could be improved:

  • There are some places where you might be encouraged to use routines from the silversupport package. There are very few! But maybe an alternative could be provided for these cases.
  • A little convention-over-configuration is probably suitable for the bundled libraries; silver includes tools to manage things, but it gets a little twisty. When creating a new project I find myself creating several .pth files, special customizing modules, etc. Managing vendor libraries is also not obvious.
  • Services are IMHO quite important and useful, but also need to be carefully specified.
  • There’s a bunch of runtime expectations that aren’t part of the format, but in practice would be part of how the application is written. For instance, I make sure each app has its own temporary directory, and that it is cleared on update. If you keep session files in that location, and you expect the environment to clean up old sessions — well, either all environments should do that, or none should.
  • The process model is not entirely clear. I tried to simply define one process model (unthreaded, multiple processes), but I’m not sure that’s suitable — most notably, multiple processes have a significant memory impact compared to threads. An application should at least be able to indicate what process models it accepts and prefers.
  • Static files are all convention over configuration — you put static files under static/ and then they are available. So static/style.css would be at /style.css. I think this is generally good, but putting all static files under one URL path (e.g., /media/) can be good for other reasons as well. Maybe there should be conventions for both.
  • Cron jobs are important. Though maybe they could just be yet another kind of service? Many extra features could be new services.
  • Logging is also important; Silver Lining attempts to handle that somewhat, but it could be specified much better.
  • Silver Lining also supports PHP, which seemed to cause a bit of stress. But just ignore that. It’s really easy to ignore.

There is a description of the configuration file for apps. The environmental variables are also notably part of the application’s expectations. The file layout is explained (together with a bunch of Silver Lining-specific concepts) in Development Patterns. Besides all that there is admittedly some other stuff that is only really specified in code; but in Silver Lining’s defense, specified in code is better than unspecified ;) App Engine provides another example of an application format, and would be worth using as a point of discussion or contrast (I did that myself when writing Silver Lining).

Discussing WSGI stuff with Ben Bangert at PyCon he noted that he didn’t really feel like the WSGI pieces needed that much more work, or at least that’s not where the interesting work was — the interesting work is in the tooling. An application format could provide a great basis for building this tooling. And I honestly think that the tooling has been held back more by divergent patterns of development than by the difficulty of writing the tools themselves; and a good, general application format could fix that.

Packaging
Programming
Python
Web

Comments (18)

Permalink

Core Competencies, Silver Lining, Packaging

I’ve been leaning heavily on Ubuntu and Debian packages for Silver Lining. Lots of "configuration management" problems are easy when you rely on the system packages… not for any magical reason, but because the package maintainers have done the configuration management for me. This includes dependencies, but also things like starting up services in the right order, and of course upgrades and security fixes. For this reason I really want everything that isn’t part of Silver Lining’s core to be in the form of a package.

But then, why isn’t everything a package? Arguably some pieces really should be, I’m just too lazy and developing those pieces too quickly to do it. But more specifically, why aren’t applications packages? It’s not the complexity really — the tool could encapsulate all the complexity of packaging if I chose. I always knew intuitively it didn’t make sense, but it took me a while to decide quite why.

There are a lot of specific reasons, but the overriding reason is that I don’t want to outsource a core function. Silver Lining isn’t a tool to install database servers. That is something it does, but that’s not its purpose, and so it can install a server with apt-get install mysql-server-5.1. In doing so it’s saving a lot of work, it’s building on the knowledge of people more interested in MySQL than I, but it’s also deferring a lot of control. When it comes to actually deploying applications I’m not willing to have Silver Lining defer to another system, because deploying applications is the entire point of the tool.

There are many specific reasons. I want multiple instances of an application deployed simultaneously. I can optimize the actual code delivery (using rsync) instead of delivering an entire bundle of code. The setup is specific to the environment and configuration I’ve set up on servers, it’s not a generic package that makes sense on a generic Ubuntu system. I don’t want any central repository of packages or a single place where releases have to go through. I want to allow for the possibility of multiple versions of an application running simultaneously. I’m coordinating services even as I deploy the application, something which Debian packages try to do a little, but don’t do consistently or well. But while there’s a list of reasons, it doesn’t matter that much — there’s no particular size of list that scared me off, and if I’m misinformed about the way packages work or if there are techniques to avoid these problems it doesn’t really matter to me… the real reason is that I don’t want to defer control over the one thing that Silver Lining must do well.

Packaging
Silver Lining

Comments Off

Permalink

toppcloud renamed to Silver Lining

After some pondering at PyCon, I decided on a new name for toppcloud: Silver Lining. I’ll credit a mysterious commenter "david" with the name idea. The command line is simply silversilver update has a nice ring to it.

There’s a new site: cloudsilverlining.org; not notably different than the old site, just a new name. The product is self-hosting now, using a simple app that runs after every commit to regenerate the docs, and with a small extension to Silver Lining itself (to make it easier to host static files). Now that it has a real name I also gave it a real mailing list.

Silver Lining also has its first test. Not an impressive test, but a test. I’m hoping with a VM-based libcloud backend that a full integration test can run in a reasonable amount of time. Some unit tests would be possible, but so far most of the bugs have been interaction bugs so I think integration tests will have to pull most of the weight. (A continuous integration rig will be very useful; I am not sure if Silver Lining can self-host that, though it’d be nice/clever if it could.)

Packaging
Programming
Python
Silver Lining
Web

Comments (7)

Permalink

A new way to deploy web applications

Deployment is one of the things I like least about development, and yet without deployment the development doesn’t really matter.

I’ve tried a few things (e.g. fassembler), built a few things (virtualenv, pip), but deployment just sucked less as a result. Then I got excited about App Engine; everyone else was getting excited about "scaling", but really I was excited about an accessible deployment process. When it comes to deployment App Engine is the first thing that has really felt good to me.

But I can’t actually use App Engine. I was able to come to terms with the idea of writing an application to the platform, but there are limits… and with App Engine there were simply too many limits. Geo stuff on App Engine is at best a crippled hack, I miss lxml terribly, I never hated relational databases, and almost nothing large works without some degree of rewriting. Sometimes you can work around it, but you can never be sure you won’t hit some wall later. And frankly working around the platform is tiring and not very rewarding.


So… App Engine seemed neat, but I couldn’t use it, and deployment was still a problem.

What I like about App Engine: an application is just files. There’s no build process, no fancy copying of things in weird locations, nothing like that; you upload files, and uploading files just works. Also, you can check everything into version control. Not just your application code, but every library you use, the exact files that you installed. I really wanted a system like that.

At the same time, I started looking into "the cloud". It took me a while to get a handle on what "cloud computing" really means. What I learned: don’t overthink it. It’s not magic. It’s just virtual private servers that can be requisitioned automatically via an API, and are billed on a short time cycle. You can expand or change the definition a bit, but this definition is the one that matters to me. (I’ve also realized that I cannot get excited about complicated solutions; only once I realized how simple cloud computing is could I really get excited about the idea.)

Given the modest functionality of cloud computing, why does it matter? Because with a cloud computing system you can actually test the full deployment stack. You can create a brand-new server, identical to all servers you will create in the future; you can set this server up; you can deploy to it. You get it wrong, you throw away that virtual server and start over from the beginning, fixing things until you get it right. Billing is important here too; with hourly billing you pay cents for these tests, and you don’t need a pool of ready servers because the cloud service basically manages that pool of ready servers for you.

Without "cloud computing" we each too easily find ourselves in a situation where deployments are ad hoc, server installations develop over time, and servers and applications are inconsistent in their configuration. Cloud computing makes servers disposable, which means we can treat them in consistent ways, testing our work as we go. It makes it easy to treat operations with the same discipline as software.

Given the idea from App Engine, and the easy-to-use infrastructure of a cloud service, I started to script together something to manage the servers and start up applications. I didn’t know what exactly I wanted to do to start, and I’m not completely sure where I’m going with this. But on the whole this feels pretty right. So I present the provisionally-named: toppcloud (Update: this has been renamed Silver Cloud).


How it works: first you have a directory of files that defines your application. This probably includes a checkout of your "application" (let’s say in src/mynewapp/), and I find it also useful to use source control on the libraries (which are put in lib/python/). There’s a file in app.ini that defines some details of the application (very similar to app.yaml).

While app.ini is a (very minimal) description of the application, there is no description of the environment. You do not specify database connection details, for instance. Instead your application requests access to a database service. For instance, one of these services is a PostgreSQL/PostGIS database (which you get if you put service.postgis in your app.ini file). If you ask for that then there will be evironmental variables, CONFIG_PG_DBNAME etc., that will tell your application how to connect to the database. (For local development you can provide your own configuration, based on how you have PostgreSQL or some other service installed.)

The standard setup is also a virtualenv environment. It is setup so every time you start that virtualenv environment you’ll get those configuration-related environmental variables. This means your application configuration is always present, your services always available. It’s available in tests just like it is during a request. Django accomplishes something similar with the (much maligned) $DJANGO_SETTINGS_MODULE but toppcloud builds it into the virtualenv environment instead of the shell environment.

And how is the server setup? Much like with App Engine that is merely an implementation detail. Unlike App Engine that’s an implementation detail you can actually look at and change (by changing toppcloud), but it’s not something you are supposed to concern yourself with during regular application development.

The basic lifecycle using toppcloud looks like:

toppcloud create-node
Create a new virtual server; you can create any kind of supported server, but only Ubuntu Jaunty or Karmic are supported (and Jaunty should probably be dropped). This step is where the "cloud" part actually ends. If you want to install a bare Ubuntu onto an existing physical machine that’s fine too — after toppcloud create-node the "cloud" part of the process is pretty much done. Just don’t go using some old Ubuntu install; this tool is for clean systems that are used only for toppcloud.
toppcloud setup-node
Take that bare Ubuntu server and set it up (or update it) for use with toppcloud. This installs all the basic standard stuff (things like Apache, mod_wsgi, Varnish) and some management script that toppcloud runs. This is written to be safe to run over and over, so upgrading and setting up a machine are the same. It needs to be a bare server, but
toppcloud init path/to/app/
Setup a basic virtualenv environment with some toppcloud customizations.
toppcloud serve path/to/app
Serve up the application locally.
toppcloud update --host=test.example.com path/to/app/
This creates or updates an application at the given host. It edits /etc/hosts so that the domain is locally viewable.
toppcloud run test.example.com script.py
Run a script (from bin/) on a remote server. This allows you to run things like django-admin.py syncdb.

There’s a few other things — stuff to manage the servers and change around hostnames or the active version of applications. It’s growing to fit a variety of workflows, but I don’t think its growth is unbounded.


So… this is what toppcloud. From the outside it doen’t do a lot. From the inside it’s not actually that complicated either. I’ve included a lot of constraints in the tool but I think it offers an excellent balance. The constraints are workable for applications (insignificant for many applications), while still exposing a simple and consistent system that’s easier to reason about than a big-ball-of-server.

Some of the constraints:

  1. Binary packages are supported via Ubuntu packages; you only upload portable files. If you need a library like lxml, you need to request that package (python-lxml) to be installed in your app.ini. If you need a version of a binary library that is not yet packaged, I think creating a new deb is reasonable.
  2. There is no Linux distribution abstraction, but I don’t care.
  3. There is no option for the way your application is run — there’s one way applications are run, because I believe there is a best practice. I might have gotten the best practice wrong, but that should be resolved inside toppcloud, not inside applications. Is Varnish a terrible cache? Probably not, but if it is we should all be able to agree on that and replace it. If there are genuinely different needs then maybe additional application or deployment configuration will be called for — but we shouldn’t add configuration just because someone says there is a better practice (and a better practice that is not universally better); there must be justifications.
  4. By abstracting out services and persistence some additional code is required for each such service, and that code is centralized in toppcloud, but it means we can also start to add consistent tools usable across a wide set of applications and backends.
  5. All file paths have to be relative, because files get moved around. I know of some particularly problematic files (e.g., .pth files), and toppcloud fixes these automatically. Mostly this isn’t so hard to do.

These particular compromises are ones I have not seen in many systems (and I’ve started to look more). App Engine I think goes too far with its constraints. Heroku is close, but closed source.

This is different than a strict everything-must-be-a-package strategy. This deployment system is light and simple and takes into account reasonable web development workflows. The pieces of an application that move around a lot are all well-greased and agile. The parts of an application that are better to Do Right And Then Leave Alone (like Apache configuration) are static.

Unlike generalized systems like buildout this system avoids "building" entirely, making deployment a simpler and lower risk action, leaning on system packages for the things they do best. Other open source tools emphasize a greater degree of flexibility than I think is necessary, allowing people to encode exploratory service integration into what appears to be an encapsulated build (I’m looking at you buildout).

Unlike requirement sets and packaging and versioning libraries, this makes all the Python libraries (typically the most volatile libraries) explicit and controlled, and can better ensure that small updates really are small. It doesn’t invalidate installers and versioning, but it makes that process even more explicit and encourages greater thoughtfulness.

Unlike many production-oriented systems (what I’ve seen in a lot of "cloud" tools) this encorporates both the development environment and production environment; but unlike some developer-oriented systems this does not try to normalize everyone’s environment and instead relies on developers to set up their systems however is appropriate. And unlike platform-neutral systems this can ensure an amount of reliability and predictability through extremely hard requirements (it is deployed on Ubuntu Jaunty/Karmic only).

But it’s not all constraints. Toppcloud is solidly web framework neutral. It’s even slightly language neutral. Though it does require support code for each persistence technique, it is fairly easy to do, and there are no requirements for "scalable systems"; I think unscalable systems are a perfectly reasonable implementation choice for many problems. I believe a more scalable system could be built on this, but as a deployment and development option, not a starting requirement.

So far I’ve done some deployments using toppcloud; not a lot, but some. And I can say that it feels really good; lots of rough edges still, but the core concept feels really right. I’ve made a lot of sideways attacks on deployment, and a few direct attacks… sometimes I write things that I think are useful, and sometimes I write things that I think are right. Toppcloud is at the moment maybe more right than useful. But I genuinely believe this is (in theory) a universally appropriate deployment tool.


Alright, so now you think maybe you should look more at toppcloud…

Well, I can offer you a fair amount of documentation. A lot of that documentation refers to design, and a bit of it to examples. There’s also a couple projects you can look at; they are all small, but :

  • Frank (will be interactivesomerville.org) which is another similar Django/Pinax project (Pinax was a bit tricky). This is probably the largest project. It’s a Django/Pinax volunteer-written application for collecting community feedback the Boston Greenline project, if that sounds interesting to you might want to chip in on the development (if so check out the wiki).
  • Neighborly, with minimal functionality (we need to run more sprints) but an installation story.
  • bbdocs which is a very simple bitbucket document generator, that makes the toppcloud site.
  • geodns which is another simple no-framework PostGIS project.

Now, the letdown. One thing I cannot offer you is support. THERE IS NO SUPPORT. I cannot now, and I might never really be able to support this tool. This tool is appropriate for collaborators, for people who like the idea and are ready to build on it. If it grows well I hope that it can grow a community, I hope people can support each other. I’d like to help that happen. But I can’t do that by bootstrapping it through unending support, because I’m not good at it and I’m not consistent and it’s unrealistic and unsustainable. This is not a open source dead drop. But it’s also not My Future; I’m not going to build a company around it, and I’m not going to use all my free time supporting it. It’s a tool I want to exist. I very much want it to exist. But even very much wanting something is not the same as being an undying champion, and I am not an undying champion. If you want to tell me what my process should be, please do!


If you want to see me get philosophical about packaging and deployment and other stuff like that, see my upcoming talk at PyCon.

Packaging
Programming
Python
Silver Lining
Web

Comments (13)

Permalink

Using pip Requirements

Following onto a set of recent posts (from James, me, then James again), Martijn Faassen wrote a description of Grok’s version management. Our ideas are pretty close, but he’s using buildout, and I’ll describe out to do the same things with pip.

Here’s a kind of development workflow that I think works well:

  • A framework release is prepared. Ideally there’s a buildbot that has been running (as Pylons has, for example), so the integration has been running for a while.
  • People make sure there are released versions of all the important components. If there are known conflicts between pieces, libraries and the framework update their install_requires in their setup.py files to make sure people don’t use conflicting pieces together.
  • Once everything has been released, there is a known set of packages that work together. Using a buildbot maybe future versions will also work together, but they won’t necessarily work together with applications built on the framework. And breakage can also occur regardless of a buildbot.
  • Also, people may have versions of libraries already installed, but just because they’ve installed something doesn’t mean they really mean to stay with an old version. While known conflicts have been noted, there’s going to be lots of unknown conflicts and future conflicts.
  • When starting development with a framework, the developer would like to start with some known-good set, which is a set that can be developed by the framework developers, or potentially by any person. For instance, if you extend a public framework with an internal framework (or even a public sub-framework like Pinax) then the known-good set will be developed by a different set of people.
  • As an application is developed, the developer will add on other libraries, or use some of their own libraries. Development will probably occur at the trunk/tip of several libraries as they are developed together.
  • A developer might upgrade the entire framework, or just upgrade one piece (for instance, to get a bug fix they are interested in, or follow a branch that has functionality they care about). The developer doesn’t necessarily have the same notion of "stable" and "released" as the core framework developers have.
  • At the time of deployment the developer wants to make sure all the pieces are deployed together as they’ve tested them, and how they know them to work. At any time, another developer may want to clone the same set of libraries.
  • After initial deployment, the developer may want to upgrade a single component, if only to test that an upgrade works, or if it resolves a bug. They may test out combinations only to throw them away, and they don’t want to bump versions of libraries in order to deploy new combinations.

This is the kind of development pattern that requirement files are meant to assist with. They can provide a known-good set of packages. Or they can provide a starting point for an active line of development. Or they can provide a historical record of how something was put together.

The easy way to start a requirement file for pip is just to put the packages you know you want to work with. For instance, we’ll call this project-start.txt:


Pylons
-e svn+http://mycompany/svn/MyApp/trunk#egg=MyApp
-e svn+http://mycompany/svn/MyLibrary/trunk#egg=MyLibrary
 

You can plug away for a while, and maybe you decide you want to freeze the file. So you do:


$ pip freeze -r project-start.txt project-frozen.txt
 

By using -r project-start.txt you give pip freeze a template for it to start with. From that, you’ll get project-frozen.txt that will look like:


Pylons==0.9.7
-e svn+http://mycompany/svn/MyApp/trunk@1045#egg=MyApp
-e svn+http://mycompany/svn/MyLibrary/trunk@1058#egg=MyLibrary

## The following requirements were added by pip --freeze:
Beaker==0.2.1
WebHelpers==0.9.1
nose==1.4
# Installing as editable to satisfy requirement INITools==0.2.1dev-r3488:
-e svn+http://svn.colorstudy.com/INITools/trunk@3488#egg=INITools-0.2.1dev_r3488
 

At that point you might decide that you don’t care about the nose version, or you might have installed something from trunk when you could have used the last release. So you go and adjust some things.

Martijn also asks: how do you have framework developers maintain one file, and then also have developers maintain their own lists for their projects?

You could start with a file like this for the framework itself. Pylons for instance could ship with something like this. To install Pylons you could then do:


$ pip -E MyProject install \\
>    -r http://pylonshq.com/0.9.7-requirements.txt
 

You can also download that file yourself, add some comments, rename the file and add your project to it, and use that. When you freeze the order of the packages and any comments will be preserved, so you can keep track of what changed. Also it should be ameniable to source control, and diffs would be sensible.

You could also use indirection, creating a file like this for your project:


-r http://pylonshq.com/0.9.7-requirements.txt
-e svn+http://mycompany/svn/MyApp/trunk#egg=MyApp
-e svn+http://mycompany/svn/MyLibrary/trunk#egg=MyLibrary
 

That is, requirements files can refer to each other. So if you want to maintain your own requirements file alongside the development of an upstream requirements file, you could do that.

Packaging
Python

Comments (3)

Permalink

A Few Corrections To “On Packaging”

James Bennett recently wrote an article on Python packaging and installation, and Setuptools. There’s a lot of issues, and writing up my thoughts could take a long time, but I thought at least I should correct some errors, specifically category errors. Figuring out where all the pieces in Setuptools (and pip and virtualenv) fit is difficult, so I don’t blame James for making some mistakes, but in the interest of clarifying the discussion…

I will start with a kind of glossary:

Distribution:
This is something-with-a-setup.py. A tarball, zip, a checkout, etc. Distributions have names; this is the name in setup(name="...") in the setup.py file. They have some other metadata too (description, version, etc), and Setuptools adds to that metadata some. Distutils doesn’t make it very easy to add to the metadata — it’ll whine a little about things it doesn’t know, but won’t do anything with that extra data. Fixing this problem in Distutils is an important aspect of Setuptools, and part of what Distutils itself unsuitable as a basis for good library management.
package/module:
This is something you import. It is not the same as a distribution, though usually a distribution will have the same name as a package. In my own libraries I try to name the distribution with mixed case (like Paste) and the package with lower case (like paste). Keeping the terminology straight here is very difficult; and usually it doesn’t matter, but sometimes it does.
Setuptools The Distribution:
This is what you install when you install Setuptools. It includes several pieces that Phillip Eby wrote, that work together but are not strictly a single thing.
setuptools The Package:
This is what you get when you do import setuptools. Setuptools largely works by monkeypatching distutils, so simply importing setuptools activates its functionality from then on. This package is entirely focused on installation and package management, it is not something you should use at runtime (unless you are installing packages as your runtime, of course).
pkg_resources The Module:
This is also included in Setuptools The Distribution, and is for use at runtime. This is a single module that provides the ability to query what distributions are installed, metadata about those distributions, information about the location where they are installed. It also allows distributions to be "activated". A distribution can be available but not activated. Activating a distribution means adding its location to sys.path, and probably you’ve noticed how long sys.path is when you use easy_install. Almost everything that allows different libraries to be installed, or allows different versions of libraries, does it through some management of sys.path. pkg_resources also allows for generic access to "resources" (i.e., non-code files), and let’s those resources be in zip files. pkg_resources is safe to use, it doesn’t do any of the funny stuff that people get annoyed with.
easy_install:
This is also in Setuptools The Distribution. The basic functionality it provides is that given a name, it can search for package with that distribution name, and also satisfying a version requirement. It then downloads the package, installs it (using setup.py install, but with the setuptools monkeypatches in place). After that, it checks the newly installed distribution to see if it requires any other libraries that aren’t yet installed, and if so it installs them.
Eggs the Distribution Format:
These are zip files that Setuptools creates when you run python setup.py bdist_egg. Unlike a tarball, these can be binary packages, containing compiled modules, and generally contain .pyc files (which are portable across platforms, but not Python versions). This format only includes files that will actually be installed; as a result it does not include doc files or setup.py itself. All the metadata from setup.py that is needed for installation is put in files in a directory EGG-INFO.
Eggs the Installation Format:
Eggs the Distribution Format are a subset of the Installation Format. That is, if you put an Egg zip file on the path, it is installed, no other process is necessary. But the Installation Format is more general. To have an egg installed, you either need something like DistroName-X.Y.egg/ on the path, and then an EGG-INFO/ directory under that with the metadata, or a path like DistroName.egg-info/ with the metadata directly in that directory. This metadata can exist anywhere, and doesn’t have to be directly alongside the actual Python code. Egg directories are required for pkg_resources to activate and deactivate distributions, but otherwise they aren’t necessary.
pip:
This is an alternative to easy_install. It works somewhat differently than easy_install, but not much. Mostly it is better than easy_install, in that it has some extra features and is easier to use. Unlike easy_install, it downloads all distributions up-front, and generates the metadata to read distribution and version requirements. It uses Setuptools to generate this metadata from a setup.py file, and uses pkg_resources to parse this metadata. It then installs packages with the setuptools monkeypatches applied. It just happens to use an option python setup.py --single-version-externally-managed, which gets Setuptools to install packages in a more flat manner, with Distro.egg-info/ directories alongside the package. Pip installs eggs! I’ve heard the many complaints about easy_install (and I’ve had many myself), but ultimately I think pip does well by just fixing a few small issues. Pip is not a repudiation of Setuptools or the basic mechanisms that easy_install uses.
PoachEggs:
This is a defunct package that had some of the features of pip (particularly requirement files) but used easy_install for installation. Don’t bother with this, it was just a bridge to get to pip.
virtualenv:
This is a little hack that creates isolated Python environments. It’s based on virtual-python.py, which is something I wrote based on some documentation notes PJE wrote for Setuptools. Basically virtualenv just creates a bin/python interpreter that has its own value of sys.prefix, but uses the system Python and standard library. It also installs Setuptools to make it easier to bootstrap the environment (because bootstrapping Setuptools is itself a bit tedious). I’ll add pip to it too sometime. Using virtualenv you don’t have to worry about different library versions, because for any one environment you will probably only need one version of a library. On any one machine you probably need different versions, which is why installing packages system-wide is problematic for most libraries. (I’ve been meaning to write a post on why I think using system packaging for libraries is counter-productive, but that’ll wait for another time.)

So… there’s the pieces involved, at least the ones I can remember now. And I haven’t really discussed .pth files, entry points, sys.path trickery, site.py, distutils.cfg… sadly this is a complex state of affairs, but it was also complex before Setuptools.

There are a few things that I think people really dislike about Setuptools.

First, zip files. Setuptools prefers zip files, for reasons that won’t mean much to you, and maybe are more historical than anything. When a distribution doesn’t indicate if it is zip-safe, Setuptools looks at the code and sees if it uses __file__, an if not it presumes that the code is probably zip-safe. The specific problem James cites is what appears to be a bug in Django, that Django looks for code and can’t traverse into zip files in the same way that Python itself can. Setuptools didn’t itself add anything to Python to make it import zip files, that functionality was added to Python some time before. The zipped eggs that Setuptools installs are using existing (standard!) Python functionality.

That said, I don’t think zipping libraries up is all that useful, and while it should work, it doesn’t always, and it makes code harder to inspect and understand. So since it’s not that useful, I’ve disabled it when pip installs packages. I also have had it disabled on my own system for years now, by creating a distutils.cfg file with [easy_install] zip_ok = False in it. Sadly App Engine is forcing me to use zip files again, because of its absurdly small file limits… but that’s a different topic. (There is an experimental pip zip command mostly intended for App Engine.)

Another pain point is version management with setup.py and Setuptools. Indeed it is easy to get things messed up, and it is easy to piss people off by overspecifying, and sometimes things can get in a weird state for no good reason (often because of easy_install’s rather naive leap-before-you-look installation order). Pip fixes that last point, but it also tries to suggest more constructive and less painful ways to manage other pieces.

Pip requirement files are an assertion of versions that work together. setup.py requirements (the Setuptools requirements) should contain two things: 1: all the libraries used by the distribution (without which there’s no way it’ll work) and 2: exclusions of the versions of those libraries that are known not to work. setup.py requirements should not be viewed as an assertion that by satisfying those requirements everything will work, just that it might work. Only the end developer, testing the system together, can figure out if it really works. Then pip gives you a way to record that working set (using pip freeze), separate from any single distribution or library.

There’s also a lot of conflicts between Setuptools and package maintainers. This is kind of a proxy war between developers and sysadmins, who have very different motivations. It deserves a post of its own, but the conflicts are about more than just how Setuptools is implemented.

I’d love if there was a language-neutral library installation and management tool that really worked. Linux system package managers are absolutely not that tool; frankly it is absurd to even consider them as an alternative. So for now we do our best in our respective language communities. If we’re going to move forward, we’ll have to acknowledge what’s come before, and the reasoning for it.

Packaging
Python

Comments (40)

Permalink