We can answer on question what is VPS? and what is cheap dedicated servers?

December 2008

Avoiding Silos: “link” as a first-class object

One of the constant annoyances to me in web applications is the self-proclaimed need for those applications to know about everything and do everything, and only spotty ad hoc techniques for including things from other applications.

An example might be blog navigation or search, where you can only include data from the application itself. Or "Recent Posts" which can only show locally-produce posts. What if I post something elsewhere? I have to create some shoddy placeholder post to refer to it. Bah! Underlying this the data is usually structured in a specific way, with the HTML being a sort of artifact of the database, the markup transient and a slave to the database’s structure.

An example of this might be a recent post listing like:

  for post in recent_posts:
      <a href="/post/{{post.year}}/{{post.month}}/{{post.slug}}">

There’s clearly no room for exceptions in this code. I am thus proposing that any system like this should have the notion of a "link" as a first-class object. The code should look like this:

  for post in recent_posts:

Just like with changing IDs to links in service documents, the template doesn’t actually look any more complicated than it did before (simpler, even). But now we can use simple object-oriented techniques to create first-class links. The code might look like:

class Post(SomeORM):
    def url(self):
        if self.type == 'link':
            return self.body
            base = get_request().application_url
            return '%s/%s/%s/%s' % (
                base, self.year, self.month, self.slug)

    def link(self):
        return html('<a href="%s">%s</a>') % (
            self.url(), self.title)

The addition of the .url() method has the obvious effect of making these offsite links work. Using a .link() method has the added advantage of allowing things like HTML snippets to be inserted into the system (even though that is not implemented here). By allowing arbitrary HTML in certain places you make it possible for people to extend the site in little ways — possibly adding markup to a title, or allowing an item in the list that actually contains two URLs (e.g., <a href="url1">Some Item</a> (<a href="url2">via</a>)).

In the context of Python I recommend making these into methods, not properties, because it allows you to later add keyword arguments to specialize the markup (like post.link(abbreviated=True)).

One negative aspect of this is that you cannot affect all the markup through the template alone, you may have to go into the Python code to change things. Anyone have ideas for handling this problem?


Comments (13)


Javascript Status Message Display

In a little wiki I’ve been playing with I’ve been trying out little ideas that I’ve had but haven’t had a place to actually implement them. One is how notification messages work. I’m sure other people have done the same thing, but I thought I’d describe it anyway.

A common pattern is to accept a POST request and then redirect the user to some page, setting a status message. Typically the status message is either set in a cookie or in the session, then the standard template for the application has some code to check for a message and display it.

The problem with this is that this breaks all caching — at any time any page can have some message injected into it, basically for no reason at all. So I thought: why not do the whole thing in Javascript? The server will set a cookie, but only Javascript will read it.

The code goes like this; on the server (easily translated into any framework):

resp.set_cookie('flash_message', urllib.quote(msg))

I quote the message because it can contain characters unsafe for cookies, and URL quoting is a particularly easy quoting to apply.

Then I have this Javascript (using jQuery):

$(function () {
    // Anything in $(function...) is run on page load
    var flashMsg = readCookie('flash_message');
    if (flashMsg) {
        flashMsg = unescape(flashMsg);
        var el = $('<div id="flash-message">'+
          '<div id="flash-message-close">'+
          '<a title="dismiss this message" '+
          'id="flash-message-button" href="#">X</a></div>'+
          flashMsg + '</div>');
        $('a#flash-message-button', el).bind(
          'click', function () {

Note that I’ve decided to treat the flash message as HTML. I don’t see a strong risk of injection attack in this case, though I must admit I’m a little unclear about what the normal policies are for cross-domain cookie setting.

I use these cookie functions because oddly I can’t find cookie handling functions in jQuery. It’s always weird to me how primitive document.cookie is. Anyway, CSS looks like this:

#flash-message {
  margin: 0.5em;
  border: 2px solid #000;
  background-color: #9f9;
  -moz-border-radius: 4px;
  text-align: center;

#flash-message-close {
  float: right;
  font-size: 70%;
  margin: 2px;

a#flash-message-button {
  text-decoration: none;
  color: #000;
  border: 1px solid #9f9;

a#flash-message-button:hover {
  border: 1px solid #000;
  background-color: #009;
  color: #fff;

This doesn’t have non-Javascript fallback, but I think that’s okay. This isn’t something that a spider would ever see (since spiders shouldn’t be submitting forms that result in update messages). Accessible browsers generally implement Javascript so that’s also not particularly a problem, though there may be additional hints I could give in CSS or Javascript to help make this more readable (if there’s a message, it should probably be the first thing read on the page).

Another common component of pages that varies separate from the page itself is logged-in status, but that’s more heavily connected to your application. Get both into Javascript and you might be able to turn caching way up on a lot of your pages.


Comments (15)


Using pip Requirements

Following onto a set of recent posts (from James, me, then James again), Martijn Faassen wrote a description of Grok’s version management. Our ideas are pretty close, but he’s using buildout, and I’ll describe out to do the same things with pip.

Here’s a kind of development workflow that I think works well:

  • A framework release is prepared. Ideally there’s a buildbot that has been running (as Pylons has, for example), so the integration has been running for a while.
  • People make sure there are released versions of all the important components. If there are known conflicts between pieces, libraries and the framework update their install_requires in their setup.py files to make sure people don’t use conflicting pieces together.
  • Once everything has been released, there is a known set of packages that work together. Using a buildbot maybe future versions will also work together, but they won’t necessarily work together with applications built on the framework. And breakage can also occur regardless of a buildbot.
  • Also, people may have versions of libraries already installed, but just because they’ve installed something doesn’t mean they really mean to stay with an old version. While known conflicts have been noted, there’s going to be lots of unknown conflicts and future conflicts.
  • When starting development with a framework, the developer would like to start with some known-good set, which is a set that can be developed by the framework developers, or potentially by any person. For instance, if you extend a public framework with an internal framework (or even a public sub-framework like Pinax) then the known-good set will be developed by a different set of people.
  • As an application is developed, the developer will add on other libraries, or use some of their own libraries. Development will probably occur at the trunk/tip of several libraries as they are developed together.
  • A developer might upgrade the entire framework, or just upgrade one piece (for instance, to get a bug fix they are interested in, or follow a branch that has functionality they care about). The developer doesn’t necessarily have the same notion of "stable" and "released" as the core framework developers have.
  • At the time of deployment the developer wants to make sure all the pieces are deployed together as they’ve tested them, and how they know them to work. At any time, another developer may want to clone the same set of libraries.
  • After initial deployment, the developer may want to upgrade a single component, if only to test that an upgrade works, or if it resolves a bug. They may test out combinations only to throw them away, and they don’t want to bump versions of libraries in order to deploy new combinations.

This is the kind of development pattern that requirement files are meant to assist with. They can provide a known-good set of packages. Or they can provide a starting point for an active line of development. Or they can provide a historical record of how something was put together.

The easy way to start a requirement file for pip is just to put the packages you know you want to work with. For instance, we’ll call this project-start.txt:

-e svn+http://mycompany/svn/MyApp/trunk#egg=MyApp
-e svn+http://mycompany/svn/MyLibrary/trunk#egg=MyLibrary

You can plug away for a while, and maybe you decide you want to freeze the file. So you do:

$ pip freeze -r project-start.txt project-frozen.txt

By using -r project-start.txt you give pip freeze a template for it to start with. From that, you’ll get project-frozen.txt that will look like:

-e svn+http://mycompany/svn/MyApp/trunk@1045#egg=MyApp
-e svn+http://mycompany/svn/MyLibrary/trunk@1058#egg=MyLibrary

## The following requirements were added by pip --freeze:
# Installing as editable to satisfy requirement INITools==0.2.1dev-r3488:
-e svn+http://svn.colorstudy.com/INITools/trunk@3488#egg=INITools-0.2.1dev_r3488

At that point you might decide that you don’t care about the nose version, or you might have installed something from trunk when you could have used the last release. So you go and adjust some things.

Martijn also asks: how do you have framework developers maintain one file, and then also have developers maintain their own lists for their projects?

You could start with a file like this for the framework itself. Pylons for instance could ship with something like this. To install Pylons you could then do:

$ pip -E MyProject install \\
>    -r http://pylonshq.com/0.9.7-requirements.txt

You can also download that file yourself, add some comments, rename the file and add your project to it, and use that. When you freeze the order of the packages and any comments will be preserved, so you can keep track of what changed. Also it should be ameniable to source control, and diffs would be sensible.

You could also use indirection, creating a file like this for your project:

-r http://pylonshq.com/0.9.7-requirements.txt
-e svn+http://mycompany/svn/MyApp/trunk#egg=MyApp
-e svn+http://mycompany/svn/MyLibrary/trunk#egg=MyLibrary

That is, requirements files can refer to each other. So if you want to maintain your own requirements file alongside the development of an upstream requirements file, you could do that.


Comments (3)


A Few Corrections To “On Packaging”

James Bennett recently wrote an article on Python packaging and installation, and Setuptools. There’s a lot of issues, and writing up my thoughts could take a long time, but I thought at least I should correct some errors, specifically category errors. Figuring out where all the pieces in Setuptools (and pip and virtualenv) fit is difficult, so I don’t blame James for making some mistakes, but in the interest of clarifying the discussion…

I will start with a kind of glossary:

This is something-with-a-setup.py. A tarball, zip, a checkout, etc. Distributions have names; this is the name in setup(name="...") in the setup.py file. They have some other metadata too (description, version, etc), and Setuptools adds to that metadata some. Distutils doesn’t make it very easy to add to the metadata — it’ll whine a little about things it doesn’t know, but won’t do anything with that extra data. Fixing this problem in Distutils is an important aspect of Setuptools, and part of what Distutils itself unsuitable as a basis for good library management.
This is something you import. It is not the same as a distribution, though usually a distribution will have the same name as a package. In my own libraries I try to name the distribution with mixed case (like Paste) and the package with lower case (like paste). Keeping the terminology straight here is very difficult; and usually it doesn’t matter, but sometimes it does.
Setuptools The Distribution:
This is what you install when you install Setuptools. It includes several pieces that Phillip Eby wrote, that work together but are not strictly a single thing.
setuptools The Package:
This is what you get when you do import setuptools. Setuptools largely works by monkeypatching distutils, so simply importing setuptools activates its functionality from then on. This package is entirely focused on installation and package management, it is not something you should use at runtime (unless you are installing packages as your runtime, of course).
pkg_resources The Module:
This is also included in Setuptools The Distribution, and is for use at runtime. This is a single module that provides the ability to query what distributions are installed, metadata about those distributions, information about the location where they are installed. It also allows distributions to be "activated". A distribution can be available but not activated. Activating a distribution means adding its location to sys.path, and probably you’ve noticed how long sys.path is when you use easy_install. Almost everything that allows different libraries to be installed, or allows different versions of libraries, does it through some management of sys.path. pkg_resources also allows for generic access to "resources" (i.e., non-code files), and let’s those resources be in zip files. pkg_resources is safe to use, it doesn’t do any of the funny stuff that people get annoyed with.
This is also in Setuptools The Distribution. The basic functionality it provides is that given a name, it can search for package with that distribution name, and also satisfying a version requirement. It then downloads the package, installs it (using setup.py install, but with the setuptools monkeypatches in place). After that, it checks the newly installed distribution to see if it requires any other libraries that aren’t yet installed, and if so it installs them.
Eggs the Distribution Format:
These are zip files that Setuptools creates when you run python setup.py bdist_egg. Unlike a tarball, these can be binary packages, containing compiled modules, and generally contain .pyc files (which are portable across platforms, but not Python versions). This format only includes files that will actually be installed; as a result it does not include doc files or setup.py itself. All the metadata from setup.py that is needed for installation is put in files in a directory EGG-INFO.
Eggs the Installation Format:
Eggs the Distribution Format are a subset of the Installation Format. That is, if you put an Egg zip file on the path, it is installed, no other process is necessary. But the Installation Format is more general. To have an egg installed, you either need something like DistroName-X.Y.egg/ on the path, and then an EGG-INFO/ directory under that with the metadata, or a path like DistroName.egg-info/ with the metadata directly in that directory. This metadata can exist anywhere, and doesn’t have to be directly alongside the actual Python code. Egg directories are required for pkg_resources to activate and deactivate distributions, but otherwise they aren’t necessary.
This is an alternative to easy_install. It works somewhat differently than easy_install, but not much. Mostly it is better than easy_install, in that it has some extra features and is easier to use. Unlike easy_install, it downloads all distributions up-front, and generates the metadata to read distribution and version requirements. It uses Setuptools to generate this metadata from a setup.py file, and uses pkg_resources to parse this metadata. It then installs packages with the setuptools monkeypatches applied. It just happens to use an option python setup.py --single-version-externally-managed, which gets Setuptools to install packages in a more flat manner, with Distro.egg-info/ directories alongside the package. Pip installs eggs! I’ve heard the many complaints about easy_install (and I’ve had many myself), but ultimately I think pip does well by just fixing a few small issues. Pip is not a repudiation of Setuptools or the basic mechanisms that easy_install uses.
This is a defunct package that had some of the features of pip (particularly requirement files) but used easy_install for installation. Don’t bother with this, it was just a bridge to get to pip.
This is a little hack that creates isolated Python environments. It’s based on virtual-python.py, which is something I wrote based on some documentation notes PJE wrote for Setuptools. Basically virtualenv just creates a bin/python interpreter that has its own value of sys.prefix, but uses the system Python and standard library. It also installs Setuptools to make it easier to bootstrap the environment (because bootstrapping Setuptools is itself a bit tedious). I’ll add pip to it too sometime. Using virtualenv you don’t have to worry about different library versions, because for any one environment you will probably only need one version of a library. On any one machine you probably need different versions, which is why installing packages system-wide is problematic for most libraries. (I’ve been meaning to write a post on why I think using system packaging for libraries is counter-productive, but that’ll wait for another time.)

So… there’s the pieces involved, at least the ones I can remember now. And I haven’t really discussed .pth files, entry points, sys.path trickery, site.py, distutils.cfg… sadly this is a complex state of affairs, but it was also complex before Setuptools.

There are a few things that I think people really dislike about Setuptools.

First, zip files. Setuptools prefers zip files, for reasons that won’t mean much to you, and maybe are more historical than anything. When a distribution doesn’t indicate if it is zip-safe, Setuptools looks at the code and sees if it uses __file__, an if not it presumes that the code is probably zip-safe. The specific problem James cites is what appears to be a bug in Django, that Django looks for code and can’t traverse into zip files in the same way that Python itself can. Setuptools didn’t itself add anything to Python to make it import zip files, that functionality was added to Python some time before. The zipped eggs that Setuptools installs are using existing (standard!) Python functionality.

That said, I don’t think zipping libraries up is all that useful, and while it should work, it doesn’t always, and it makes code harder to inspect and understand. So since it’s not that useful, I’ve disabled it when pip installs packages. I also have had it disabled on my own system for years now, by creating a distutils.cfg file with [easy_install] zip_ok = False in it. Sadly App Engine is forcing me to use zip files again, because of its absurdly small file limits… but that’s a different topic. (There is an experimental pip zip command mostly intended for App Engine.)

Another pain point is version management with setup.py and Setuptools. Indeed it is easy to get things messed up, and it is easy to piss people off by overspecifying, and sometimes things can get in a weird state for no good reason (often because of easy_install’s rather naive leap-before-you-look installation order). Pip fixes that last point, but it also tries to suggest more constructive and less painful ways to manage other pieces.

Pip requirement files are an assertion of versions that work together. setup.py requirements (the Setuptools requirements) should contain two things: 1: all the libraries used by the distribution (without which there’s no way it’ll work) and 2: exclusions of the versions of those libraries that are known not to work. setup.py requirements should not be viewed as an assertion that by satisfying those requirements everything will work, just that it might work. Only the end developer, testing the system together, can figure out if it really works. Then pip gives you a way to record that working set (using pip freeze), separate from any single distribution or library.

There’s also a lot of conflicts between Setuptools and package maintainers. This is kind of a proxy war between developers and sysadmins, who have very different motivations. It deserves a post of its own, but the conflicts are about more than just how Setuptools is implemented.

I’d love if there was a language-neutral library installation and management tool that really worked. Linux system package managers are absolutely not that tool; frankly it is absurd to even consider them as an alternative. So for now we do our best in our respective language communities. If we’re going to move forward, we’ll have to acknowledge what’s come before, and the reasoning for it.


Comments (40)


lxml: an underappreciated web scraping library

When people think about web scraping in Python, they usually think BeautifulSoup. That’s okay, but I would encourage you to also consider lxml.

First, people think BeautifulSoup is better at parsing broken HTML. This is not correct. lxml parses broken HTML quite nicely. I haven’t done any thorough testing, but at least the BeautifulSoup broken HTML example is parsed better by lxml (which knows that <td> elements should go inside <table> elements).

Second, people feel lxml is harder to install. This is correct. BUT, lxml 2.2alpha1 includes an option to compile static versions of the underlying C libraries, which should improve the installation experience, especially on Macs. To install this new way, try:

$ STATIC_DEPS=true easy_install 'lxml>=2.2alpha1'

One you have lxml installed, you have a great parser (which happens to be super-fast and that is not a tradeoff). You get a fairly familiar API based on ElementTree, which though a little strange feeling at first, offers a compact and canonical representation of a document tree, compared to more traditional representations. But there’s more…

One of the features that should be appealing to many people doing screen scraping is that you get CSS selectors. You can use XPath as well, but usually that’s more complicated (for example). Here’s an example I found getting links from a menu in a page in BeautifulSoup:

from BeautifulSoup import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://java.sun.com').read())
menu = soup.findAll('div',attrs={'class':'pad'})
for subMenu in menu:
    links = subMenu.findAll('a')
    for link in links:
        print "%s : %s" % (link.string, link['href'])

Here’s the same example in lxml:

from lxml.html import parse
doc = parse('http://java.sun.com').getroot()
for link in doc.cssselect('div.pad a'):
    print '%s: %s' % (link.text_content(), link.get('href'))

lxml generally knows more about HTML than BeautifulSoup. Also I think it does well with the small details; for instance, the lxml example will match elements in <div class="pad menu"> (space-separated classes), which the BeautifulSoup example does not do (obviously there are other ways to search, but the obvious and documented technique doesn’t pay attention to HTML semantics).

One feature that I think is really useful is .make_links_absolute(). This takes the base URL of the page (doc.base) and uses it to make all the links absolute. This makes it possible to relocate snippets of HTML or whole sets of documents (as with this program). This isn’t just <a href> links, but stylesheets, inline CSS with @import statements, background attributes, etc. It doesn’t see quite all links (for instance, links in Javascript) but it sees most of them, and works well for most sites. So if you want to make a local copy of a site:

from lxml.html import parse, open_in_browser
doc = parse('http://wiki.python.org/moin/').getroot()

open_in_browser serializes the document to a temporary file and then opens a web browser (using webbrowser).

Here’s an example that compares two pages using lxml.html.diff:

from lxml.html.diff import htmldiff
from lxml.html import parse, tostring, open_in_browser, fromstring

def get_page(url):
    doc = parse(url).getroot()
    return tostring(doc)

def compare_pages(url1, url2, selector='body div'):
    basis = parse(url1).getroot()
    other = parse(url2).getroot()
    el1 = basis.cssselect(selector)[0]
    el2 = other.cssselect(selector)[0]
    diff_content = htmldiff(tostring(el1), tostring(el2))
    diff_el = fromstring(diff_content)
    el1.getparent().insert(el1.getparent().index(el1), diff_el)
    return basis

if __name__ == '__main__':
    import sys
    doc = compare_pages(sys.argv[1], sys.argv[2], sys.argv[3])

You can use it like:

$ python lxmldiff.py \
'http://wiki.python.org/moin/BeginnersGuide?action=recall&amp;#038;rev=70' \
'http://wiki.python.org/moin/BeginnersGuide?action=recall&amp;#038;rev=81' \

Another feature lxml has is form handling. All the cool sexy new sites use minimal forms, but searching for "registration forms" I get this nice complex form. Let’s look at it:

>>> from lxml.html import parse, tostring
>>> doc = parse('http://www.actuaryjobs.com/cform.html').getroot()
>>> doc.forms
[<Element form at -48232164>]
>>> form = doc.forms[0]
>>> form.inputs.keys()
['thank_you_title', 'City', 'Zip', ... ]

Now we have a form object. There’s two ways to get to the fields: form.inputs, which gives us a dictionary of all the actual <input> elements (and textarea and select). There’s also form.fields, which is a dictionary-like object. The dictionary-like object is convenient, for instance:

>>> form.fields['cEmail'] = 'me&#64;example.com'

This actually updates the input element itself:

>>> tostring(form.inputs['cEmail'])
'<input type="input" name="cEmail" size="30" value="test2">'

I think it’s actually a nicer API than htmlfill and can serve the same purpose on the server side.

But then you can also use the same interface for scraping, by filling fields and getting the submission. That looks like:

>>> import urllib
>>> action = form.action
>>> data = urllib.urlencode(form.form_values())
>>> if form.method == 'GET':
...     if '?' in action:
...         action += '&amp;#038;' + data
...     else:
...         action += '?' + data
...     data = None
>>> resp = urllib.urlopen(action, data)
>>> resp_doc = parse(resp).getroot()

Lastly, there’s HTML cleaning. I think all these features work together well, do useful things, and it’s based on an actual understanding HTML instead of just treating tags and attributes as arbitrary. (Also if you really like jQuery, you might want to look at pyquery, which is a jQuery-like API on top of lxml).


Comments (51)


The Magic Sentinel

In an effort to get back on the blogging saddle, here’s a little note on default values in Python.

In Python there are often default values. The most typical default value is None — None is a object of vague meaning that almost screams "I’m a default". But sometimes None is a valid value, and sometimes you want to detect the case of "no value given" and None can hardly be called no value.

Here’s an example:

def getuser(username, default=None):
    if not user_exists(username):
        return default

In this case there is always a default, and so anytime you call getuser() you have to check for a None result. But maybe you have code where you’d really just like to get an exception if the user isn’t found. To get this you can use a sentinel. A sentinel is an object that has no particular meaning except to signal the end (like a NULL byte in a C string), or a special condition (like no default user).

Sometimes people do it like this:

_no_default = ()
def getuser(username, default=_no_default):
    if not user_exists(username):
        if default is _no_default:
            raise LookupError("No user with the username %r" % username)
        return default

This works because that zero-item tuple () is a unique object, and since we are using the comparison default is _no_default only that exact object will trigger that LookupError.

Once you understand the pattern, this is easy enough to read. But when you use help() or other automatic generation it is a little confusing, because the default value just appears as (). You could also use object() or [] or anything else, but the automatically generated documentation still won’t look that nice. So for a bit more polish I suggest:

class _NoDefault(object):
    def __repr__(self):
        return '(no default)'
NoDefault = _NoDefault()
del _NoDefault

def getuser(username, default=NoDefault):

You might then think "hey, why isn’t there one NoDefault that everyone can share?" If you do share that sentinel you run the risk of accidentally passing in that value even though you didn’t intend to. The value "NoDefault" will become overloaded with meaning, just as None is. By having a more private sentinel object you avoid that. A single nice sentinal factory (like _NoDefault in this example) would be nice, though. Though really PEP 3102 will probably make sentinals like this unnecessary for Python 3.0.

Note that you can also implement arguments with no default via *args and **kwargs, e.g.:

def getuser(username, *args):
    if not user_exists(username):
        if not args:
            raise LookupError(...)
            return args[0]

But to do this right you should test that len(args)<=1, raise appropriate errors, maybe consider keyword arguments, and so one. It’s a pain in the butt, and when you’re finished the signature displayed by help() will be wrong anyway.


Comments (10)