We can answer on question what is VPS? and what is cheap dedicated servers?

HTML

Throw out your frameworks! (forms included)

No, I should say forms particularly.

I have lots of things to blog about, but nothing makes me want to blog like code. Ideas are hard, code is easy. So when I saw Jacob’s writeup about dynamic Django form generation I felt a desire to respond. I didn’t see the form panel at PyCon (I intended to but I hardly saw any talks at PyCon, and yet still didn’t even see a good number of the people I wanted to see), but as the author of an ungenerator and as a general form library skeptic I have a somewhat different perspective on the topic.

The example created for the panel might display that perspective. You should go read Jacob’s description; but basically it’s a simple registration form with a dynamic set of questions to ask.

I have created a complete example, because I wanted to be sure I wasn’t skipping anything, but I’ll present a trimmed-down version.

First, the basic control logic:


from webob.dec import wsgify
from webob import exc
from formencode import htmlfill

@wsgify
def questioner(req):
    questions = get_questions(req) # This is provided as part of the example
    if req.method == 'POST':
        errors = validate(req, questions)
        if not errors:
            ... save response ...
            return exc.HTTPFound(location='/thanks')
    else:
        errors = {}
    ## Here's the "form generation":
    page = page_template.substitute(
        action=req.url,
        questions=questions)
    page = htmlfill.render(
        page,
        defaults=req.POST,
        errors=errors)
    return Response(page)

def validate(req, questions):
    # All manual, but do it however you want:
    errors = {}
    form = req.POST
    if (form.get('password')
        and form['password'] != form.get('password_confirm')):
        errors['password_confirm'] = 'Passwords do not match'
    fields = questions + ['username', 'password']
    for field in fields:
        if not form.get(field):
            errors[field] = 'Please enter a value'
    return errors
 

I’ve just manually handled validation here. I don’t feel like doing it with FormEncode. Manual validation isn’t that big a deal; FormEncode would just produce the same errors dictionary anyway. In this case (as in many form validation cases) you can’t do better than hand-written validation code: it’s shorter, more self-contained, and easier to tweak.

After validation the template is rendered:


page = page_template.substitute(
    action=req.url,
    questions=questions)
 

I’m using Tempita, but it really doesn’t matter. The template looks like this:


<form action="{{action}}" method="POST">
New Username: <input type="text" name="username"><br />
Password: <input type="password" name="password"><br />
Repeat Password:
  <input type="password" name="password_confirm"><br />
{{for question in questions}}
  {{question}}: <input type="text" name="{{question}}"><br />
{{endfor}}
<input type="submit">
</form>
 

Note that the only "logic" here is to render the form to include fields for all the questions. Obviously this produces an ugly form, but it’s very obvious how you make this form pretty, and how to tweak it in any way you might want. Also if you have deeper dynamicism (e.g., get_questions start returning the type of response required, or weird validation, or whatever) it’s very obvious where that change would go: display logic goes in the form, validation logic goes in that validate function.

This just gives you the raw form. You wouldn’t need a template at all if it wasn’t for the dynamicism. Everything else is added when the form is "filled":


page = htmlfill.render(
    page,
    defaults=req.POST,
    errors=errors)
 

How exactly you want to calculate defaults is up to the application; you might want query string variables to be able to pre-fill the form (use req.params), you might want the form bare to start (like here with req.POST), you can easily implement wizards by stuffing req.POST into the session to repeat a form, you might read the defaults out of a user object to make this an edit form. And errors are just handled automatically, inserted into the HTML with appropriate CSS classes.

A great aspect of this pattern if you use it (I’m not even sure it deserves the moniker library): when HTML 5 Forms finally come around and we can all stop doing this stupid server-side overthought nonsense, you won’t have overthought your forms. Your mind will be free and ready to accept that the world has actually become simpler, not more complicated, and that there is knowledge worth forgetting (forms are so freakin’ stupid!) If at all possible, dodging complexity is far better than cleverly responding to complexity.

HTML
Programming
Python
Web

Comments (27)

Permalink

Weave: valuable client-side data

I’ve been looking at Weave some lately. The large-print summary on the page is Synchronize Your Firefox Experience Across Desktop and Mobile. Straight forward enough.

Years and years ago I stopped really using bookmarks. You lose them moving from machine to machine (which Weave could help), but mostly I stopped using them because it was too hard to (a) identify interesting content and place it into a taxonomy and (b) predict what I would later be interested in. If I wanted to refer to something I’d seen before there’s a good chance I wouldn’t have saved it, while my bookmarks would be flooded with things that time would show were of transient interest.

So… synchronizing bookmarks, eh. Saved form data and logins? Sure, that’s handy. It would make browsing on multiple machines nicer. But it feels more like a handy tweak.

All my really useful data is kept on servers, categorized and protected by a user account. Why is that? Well, of course, where else would you keep it? In cookies? Ha!

Why not in cookies? So many reasons… because cookies are opaque and can’t hold much data, can’t be exchanged, and probably worst of all they just disappear randomly.

What if cookies weren’t so impossibly lame for holding important data? Suddenly sync seems much more interesting. Instead of storing documents and data on a website, the website could put all that data right into your browser. And conveniently HTML 5 has an API for that. Everyone thinks about that API as a way of handling off-line caching, because while it handles many problems with cookies it doesn’t handle the problem of data disappearing as you move between computers and browser. That’s where Weave synchronization could change things. I don’t think this technique is something appropriate for every app (maybe not most apps), but it could allow a new class of applications.

Advantages: web development and scaling becomes easy. If you store data in the browser scaling is almost free; serving static pages is a Solved Problem. Development is easier because development and deployment of HTML and Javascript is pretty easy. Forking is easy — just copy all the resources. So long as you don’t hardcode absolute links into your Javascript, you can even just save from the browser and get a working local copy of the application.

Disadvantages: another silo. You might complain about Facebook keeping everyone’s data, but the data in Facebook is still more transparent than data held in files or locally with a browser. Let’s say you create a word processor that uses local storage for all its documents. If you stored that document online sharing and collaboration would be really easy; but with it stored locally the act of sharing is not as automatic, and collaboration is very hard. Sure, the "user" is in "control" of their data, but that would be more true on paper than in practice. Building collaboration on top of local storage is hard, and without that… maybe it’s not that interesting?

Anyway, there is an interesting (but maybe Much Too Hard) problem in there. (DVCS in the browser?)

Update: in this video Aza talks about just what I talk about here. A few months ago. The Weave APIs also allude to things like this, including collaboration. So… they are on it!

HTML
Web

Comments (2)

Permalink

Modern Web Design, I Renounce Thee!

I’m not a designer, but I spend as much time looking at web pages as the next guy. So I took interest when I came upon this post on font size by Wilson Miner, which in turn is inspired by the 100e2r (100% easy to read) standard by Oliver Reichenstein.

The basic idea is simple: we should have fonts at the "default" size, about 16px, no smaller. This is about the size of text in print, read at a reasonable distance (typically closer up than a screen):

http://blog.ianbicking.org/wp-content/uploads/images/typesize_comparison2.jpg

Also it calls out low-contrast color schemes, which I think are mostly passe, and I will not insult you, my reader, by suggesting you don’t entirely agree. Because if you don’t agree, well, I’m afraid I’d have to use some strong words.

I think small fonts, low contrast, huge amounts of whitespace, are a side effect of the audience designers create for.

This makes me think of Modern Architecture:

http://blog.ianbicking.org/wp-content/uploads/images/300px-seagram.jpg

This is a form of architecture popular for skyscapers and other dramatic structures, with their soaring heights and other such dramatic adjectives. These are buildings designed for someone looking at the building from five hundred feet away. They are not designed for occupants. But that’s okay, because the design isn’t sold to occupants, it is sold to people who look at the sketches and want to feel very dramatic.

Similarly, I think the design pattern of small fonts is something meant to appeal to shallow observation. By deemphasizing the text itself, the design is accentuated. Low-contrast text is even more obviously the domination of design over content. And it may very well look more professional and visually pleasing. But web design isn’t for making sites visually pleasing, it is for making the experience of the content more pleasing. Sites exist for their content, not their design.

In 100e2r he also says let your text breathe. You need whitespace. If you view my site directly, you’ll notice I don’t have big white margins around my text. When you come to my site, it’s to see my words, and that’s what I’m going to give you! When I want to let my text breathe with lots of whitespace this is what I do:

http://blog.ianbicking.org/wp-content/uploads/images/500px-my-white-desktop.jpg

Is a huge block of text hard to read? It is. And yeah, I’ve written articles like that. But the solution?

WRITE BETTER

Similarly, it’s hard to read text if you don’t use paragraphs, but the solution isn’t to increase your line height until every line is like a paragraph of its own.

The solution to the drudgery of large swathes of text is:

  1. Make your blocks of text smaller.
  2. Use something other than paragraphs of text.

Throw in a list. Do some indentation. Toss in even a stupid picture. Personally I try to throw in code examples, because that’s how we roll on this blog.

That’s good writing, that’s content that is easy to read. It’s not easy to write, and I’m sure I miss the mark more often than not. But you can’t design your way to good content. If you want to write like this, if you want to let the flow of your text reflect the flow of your ideas, you need room. Huge margins don’t give you room. They are a crutch for poor writing, and not even a good crutch.

So in conclusion: modern design be damned!

HTML
Non-technical
Web

Comments (21)

Permalink

Atompub as an alternative to WebDAV

I’ve been thinking about an import/export API for PickyWiki; I want something that’s sensible, and works well enough that it can be the basic for things like creating restorable snapshots, integration with version control systems, and being good at self-hosting documentation.

So far I’ve made a simple import/export system based on Atom. You can export the entire site as an Atom feed, and you can import Atom feeds. But whole-site import/export isn’t enough for the tools I’d like to write on top of the API.

WebDAV would seem like a logical choice, as it lets you get and put resources. But it’s not a great choice for a few reasons:

  • It’s really hard to implement on the server.
  • Even clients are hard to implement.
  • It uses GET to get resources. This is probably its most fatal flaw. There is no CMS that I know of (except maybe one) where the thing you view the browser is the thing that you’d actually edit. To work around this CMSes use User-Agent sniffing or an alternate URL space.
  • WebDAV is worried about "collections" (i.e., directories). The web basically doesn’t know what "collections" are, it only knows paths, and paths are strings.
  • (In summary) WebDAV uses HTTP, but it is not of the web.

I don’t want to invent something new though. So I started thinking of Atom some more, and Atompub.

The first thought is how to fix the GET problem in WebDAV. A web page isn’t an editable representation, but it’s pretty reasonable to put an editable representation into an Atom entry. Clients won’t necessarily understand extensions and properties you might add to those entries, but I don’t see any way around that. An entry might look like:


<entry>
  <content type="html">QUOTED HTML</content>
  ... other normal metadata (title etc) ...
  <privateprop:myproperty xmlns:privateprop="URL" name="foo" value="bar" />
</entry>
 

While there is special support for HTML, XHTML, and plain text in Atom, you can put any type of content in <content>, encoded in base64.

To find the editable representation, the browser page can point to it. I imagine something like this:


<link rel="alternate" type="application/atom+xml; type=entry"
 href="this-url?format=atom">
 

The actual URL (in this example this-url?format=atom) can be pretty much anything. My one worry is that this could be confused with feed detection, which looks like:


<link rel="alternate" type="application/atom+xml"
 href="/atom.xml">
 

The only difference is "; type=entry", which I’m betting a lot of clients don’t pay attention to.

The Atom entries then can have an element:


<link rel="edit" href="this-url" />
 

This is a location where you can PUT a new entry to update the resource. You could allow the client to PUT directly over the old page, or use this-url?format=atom or whatever is convenient on the server-side. Additionally, DELETE to the same URL would delete.

This handles updates and deletes, and single-page reads. The next issue is creating pages.

Atompub makes creation fairly simple. First you have to get the Atompub service document. This is a document with the type application/atomsvc+xml and it gives the collection URL. It’s suggested you make this document discoverable like:


<link rel="service" type="application/atomsvc+xml"
 href="/atomsvc.xml">
 

This document then points to the "collection" URL, which for our purposes is where you create documents. The service document would look like:


<service xmlns="http://www.w3.org/2007/app"
         xmlns:atom="http://www.w3.org/2005/Atom">
  <workspace>
    <atom:title>SITE TITLE</atom:title>
    <collection href="/atomapi">
      <atom:title>SITE TITLE</atom:title>
      <accept>*/*</accept>
      <accept>application/atom+xml;type=entry</accept>
    </collection>
  </workspace>
</service>
 

Basically this indicates that you can POST any media to /atomapi (both Atom entries, and things like images).

To create a page, a client then does a POST like:


POST /atomapi
Content-Type: application/atom+xml; type=entry
Slug: /page/path

<entry xmlns="...">...</entry>
 

There’s an awkwardness here, that you can suggest (via the Slug header) what the URL for the new page is. The client can find the actual URL of the new page from the Location header in the response. But the client can’t demand that the slug be respected (getting an error back if it is not), and there’s lots of use cases where the client doesn’t just want to suggest a path (for instance, other documents that are being created might rely on that path for links).

Also, "slug" implies… well, a slug. That is, some path segment probably derived from the title. There’s nothing stopping the client from putting a complete path in there, but it’s very likely to be misinterpreted (e.g. translating /page/path to /2009/01/pagepath).

Bug I digress. Anyway, you can post every resource as an entry, base64-encoding the resource body, but Atompub also allows POSTing media directly. When you do that, the server puts the media somewhere and creates a simple Atom entry for the media. If you wanted to add properties to that entry, you’d edit the entry after creating it.

The last missing piece is how to get a list of all the pages on a site. Atompub does have an answer for this: just GET /atomapi will give you an Atom feed, and for our purposes we can demand that the feed is complete (using paging so that any one page of the feed doesn’t get too big). But this doesn’t seem like a good solution to me. GData specifies a useful set of queries to for feeds, but I’m not sure that this is very useful here; the kind of queries a client needs to do for this use case aren’t things GData was designed for.

The queries that seem most important to me are queries by page path (which allows some sense of "collections" without being formal) and by content type. Also to allow incremental updates on the client side, filtering these queries by last-modified time (i.e., all pages created since I last looked). Reporting queries (date of creation, update, author, last editor, and custom properties) of course could be useful, but don’t seem as directly applicable.

Also, often the client won’t want the complete Atom entry for the pages, but only a list of pages (maybe with minimal metadata). I’m unsure about the validity of abbreviated Atom entries, but it seems like one solution. Any Atom entry can have something like:


<link rel="self" type="application/atom+xml; type=entry"
 href="url?format=atom" />
 

This indicates where the entry exists, though it doesn’t suggest very forcefully that the actual entry is abbreviated. Anyway, I could then imagine a feed like:


<feed>
  <entry>

    <content type="some/content-type" />
    <link rel="self" href="..." />
    <updated>YYYYMMDDTHH:MM:SSZ</updated>
  <entry>
  ...
</feed>
 

This isn’t entirely valid, however — you can’t just have an empty <content> tag. You can use a src attribute to use indirection for the content, and then add Yet Another URL for each page that points to its raw content. But that’s just jumping through hoops. This also seems like an opportunity to suggest that the entry is incomplete.

To actually construct these feeds, you need some way of getting the feed. I suggest that another entry be added to the Atompub service document, something like:


<cmsapi:feed href="URI-TEMPLATE" />
 

That would be a URI Template that accepted several known variables (though frustratingly, URI Templates aren’t properly standardized yet). Things like:

  • content-type: the content type of the resource (allowing wildcards like image/*)
  • container: a path to a container, i.e., /2007 would match all pages in /2007/...
  • path-regex: some regular expression to match the paths
  • last-modified: return all pages modified at the given date or later

All parameters would be ANDed together.

So, open issues:

  • How to strongly suggest a path when creating a resource (better than Slug)
  • How to rename (move) or copy a page (it’s easy enough to punt on copy, but I’d rather move by a little more formal than just recreating a resource in a new location and deleting the original)
  • How to represent abbreviated Atom entries

With these resolved I think it’d be possible to create a much simpler API than WebDAV, and one that can be applied to existing applications much more easily. (If you think there’s more missing, please comment.)

HTML
Programming
Web

Comments (26)

Permalink

Avoiding Silos: “link” as a first-class object

One of the constant annoyances to me in web applications is the self-proclaimed need for those applications to know about everything and do everything, and only spotty ad hoc techniques for including things from other applications.

An example might be blog navigation or search, where you can only include data from the application itself. Or "Recent Posts" which can only show locally-produce posts. What if I post something elsewhere? I have to create some shoddy placeholder post to refer to it. Bah! Underlying this the data is usually structured in a specific way, with the HTML being a sort of artifact of the database, the markup transient and a slave to the database’s structure.

An example of this might be a recent post listing like:


<ul>
  for post in recent_posts:
    <li>
      <a href="/post/{{post.year}}/{{post.month}}/{{post.slug}}">
        {{post.title}}</a>
    </li>
</ul>
 

There’s clearly no room for exceptions in this code. I am thus proposing that any system like this should have the notion of a "link" as a first-class object. The code should look like this:


<ul>
  for post in recent_posts:
    <li>
      {{post.link()}}
    </li>
</ul>
 

Just like with changing IDs to links in service documents, the template doesn’t actually look any more complicated than it did before (simpler, even). But now we can use simple object-oriented techniques to create first-class links. The code might look like:


class Post(SomeORM):
    def url(self):
        if self.type == 'link':
            return self.body
        else:
            base = get_request().application_url
            return '%s/%s/%s/%s' % (
                base, self.year, self.month, self.slug)

    def link(self):
        return html('<a href="%s">%s</a>') % (
            self.url(), self.title)
 

The addition of the .url() method has the obvious effect of making these offsite links work. Using a .link() method has the added advantage of allowing things like HTML snippets to be inserted into the system (even though that is not implemented here). By allowing arbitrary HTML in certain places you make it possible for people to extend the site in little ways — possibly adding markup to a title, or allowing an item in the list that actually contains two URLs (e.g., <a href="url1">Some Item</a> (<a href="url2">via</a>)).

In the context of Python I recommend making these into methods, not properties, because it allows you to later add keyword arguments to specialize the markup (like post.link(abbreviated=True)).

One negative aspect of this is that you cannot affect all the markup through the template alone, you may have to go into the Python code to change things. Anyone have ideas for handling this problem?

HTML
Programming
Python
Web

Comments (13)

Permalink

lxml: an underappreciated web scraping library

When people think about web scraping in Python, they usually think BeautifulSoup. That’s okay, but I would encourage you to also consider lxml.

First, people think BeautifulSoup is better at parsing broken HTML. This is not correct. lxml parses broken HTML quite nicely. I haven’t done any thorough testing, but at least the BeautifulSoup broken HTML example is parsed better by lxml (which knows that <td> elements should go inside <table> elements).

Second, people feel lxml is harder to install. This is correct. BUT, lxml 2.2alpha1 includes an option to compile static versions of the underlying C libraries, which should improve the installation experience, especially on Macs. To install this new way, try:


$ STATIC_DEPS=true easy_install 'lxml>=2.2alpha1'
 

One you have lxml installed, you have a great parser (which happens to be super-fast and that is not a tradeoff). You get a fairly familiar API based on ElementTree, which though a little strange feeling at first, offers a compact and canonical representation of a document tree, compared to more traditional representations. But there’s more…

One of the features that should be appealing to many people doing screen scraping is that you get CSS selectors. You can use XPath as well, but usually that’s more complicated (for example). Here’s an example I found getting links from a menu in a page in BeautifulSoup:


from BeautifulSoup import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://java.sun.com').read())
menu = soup.findAll('div',attrs={'class':'pad'})
for subMenu in menu:
    links = subMenu.findAll('a')
    for link in links:
        print "%s : %s" % (link.string, link['href'])
 

Here’s the same example in lxml:


from lxml.html import parse
doc = parse('http://java.sun.com').getroot()
for link in doc.cssselect('div.pad a'):
    print '%s: %s' % (link.text_content(), link.get('href'))
 

lxml generally knows more about HTML than BeautifulSoup. Also I think it does well with the small details; for instance, the lxml example will match elements in <div class="pad menu"> (space-separated classes), which the BeautifulSoup example does not do (obviously there are other ways to search, but the obvious and documented technique doesn’t pay attention to HTML semantics).

One feature that I think is really useful is .make_links_absolute(). This takes the base URL of the page (doc.base) and uses it to make all the links absolute. This makes it possible to relocate snippets of HTML or whole sets of documents (as with this program). This isn’t just <a href> links, but stylesheets, inline CSS with @import statements, background attributes, etc. It doesn’t see quite all links (for instance, links in Javascript) but it sees most of them, and works well for most sites. So if you want to make a local copy of a site:


from lxml.html import parse, open_in_browser
doc = parse('http://wiki.python.org/moin/').getroot()
doc.make_links_absolute()
open_in_browser(doc)
 

open_in_browser serializes the document to a temporary file and then opens a web browser (using webbrowser).

Here’s an example that compares two pages using lxml.html.diff:


from lxml.html.diff import htmldiff
from lxml.html import parse, tostring, open_in_browser, fromstring

def get_page(url):
    doc = parse(url).getroot()
    doc.make_links_absolute()
    return tostring(doc)

def compare_pages(url1, url2, selector='body div'):
    basis = parse(url1).getroot()
    basis.make_links_absolute()
    other = parse(url2).getroot()
    other.make_links_absolute()
    el1 = basis.cssselect(selector)[0]
    el2 = other.cssselect(selector)[0]
    diff_content = htmldiff(tostring(el1), tostring(el2))
    diff_el = fromstring(diff_content)
    el1.getparent().insert(el1.getparent().index(el1), diff_el)
    el1.getparent().remove(el1)
    return basis

if __name__ == '__main__':
    import sys
    doc = compare_pages(sys.argv[1], sys.argv[2], sys.argv[3])
    open_in_browser(doc)
 

You can use it like:


$ python lxmldiff.py \
'http://wiki.python.org/moin/BeginnersGuide?action=recall&amp;#038;rev=70' \
'http://wiki.python.org/moin/BeginnersGuide?action=recall&amp;#038;rev=81' \
'div#content'
 

Another feature lxml has is form handling. All the cool sexy new sites use minimal forms, but searching for "registration forms" I get this nice complex form. Let’s look at it:


>>> from lxml.html import parse, tostring
>>> doc = parse('http://www.actuaryjobs.com/cform.html').getroot()
>>> doc.forms
[<Element form at -48232164>]
>>> form = doc.forms[0]
>>> form.inputs.keys()
['thank_you_title', 'City', 'Zip', ... ]
 

Now we have a form object. There’s two ways to get to the fields: form.inputs, which gives us a dictionary of all the actual <input> elements (and textarea and select). There’s also form.fields, which is a dictionary-like object. The dictionary-like object is convenient, for instance:


>>> form.fields['cEmail'] = 'me&#64;example.com'
 

This actually updates the input element itself:


>>> tostring(form.inputs['cEmail'])
'<input type="input" name="cEmail" size="30" value="test2">'
 

I think it’s actually a nicer API than htmlfill and can serve the same purpose on the server side.

But then you can also use the same interface for scraping, by filling fields and getting the submission. That looks like:


>>> import urllib
>>> action = form.action
>>> data = urllib.urlencode(form.form_values())
>>> if form.method == 'GET':
...     if '?' in action:
...         action += '&amp;#038;' + data
...     else:
...         action += '?' + data
...     data = None
>>> resp = urllib.urlopen(action, data)
>>> resp_doc = parse(resp).getroot()
 

Lastly, there’s HTML cleaning. I think all these features work together well, do useful things, and it’s based on an actual understanding HTML instead of just treating tags and attributes as arbitrary. (Also if you really like jQuery, you might want to look at pyquery, which is a jQuery-like API on top of lxml).

HTML
Programming
Python

Comments (51)

Permalink

The Philosophy of Deliverance

I’ll be attending PloneConf this year again, giving a talk about Deliverance. I’ve been working on Deliverance lately for work, but the hard part about it is that it’s not obviously useful. To help explain it I wrote the philosophy of Deliverance, which I will copy here, to give you an idea of what I’ve been doing:

Why is Deliverance? Why was it made, what purpose does it serve, why should you use it, how can it change the way you do web development?

On the Subject of Platforms

Right now we live in an age of platforms. Developers (or management or coincidence) decides on a platform, and that serves as the basis for all future development. Usually there’s some old things from a previous platform (or a primordial pre-platform age: I’m looking at you formmail.pl!) The goal is always to eliminate all of these old pieces, rewriting them for the new platform. That goal is seldom attained in a timely manner, and even before it is accomplished you may be moving to the next platform.

Why do you have to port everything forward to the newest platform? Well, presumably it is better engineered. The newest platform is presumably what people are most familiar with. But if those were the only reasons it would be hard to justify a rewrite of working software. Often the real push comes because your systems don’t work together. It’s hard to keep templates in sync across all the platforms. Multiple logins may be required. Navigation is inconsistent and incomplete. Functionality that cross-cuts pages — comments, login status, shopping cart status, etc — isn’t universally available.

A similar conflict arises when you consider how to add new functionality to a site. For example, you may want to add a blog. Do you:

  1. Use the best blogging software available?
  2. Use something native to your platform?
  3. Write something yourself?

The answer is probably 2 or 3, because it would be too hard to integrate something foreign to your platform. This form of choice means that every platform has some kind of "blog", but the users of that blog are likely to only be a subset of the users of the parent platform. This makes it difficult for winners to emerge, or for a well-developed piece of software to really be successful. Platform-based software is limited by the adoption of the platform.

Not all software has a platform. These tend to be the most successful web applications, things like Trac, WordPress, etc.

"Aha!" you think "I’ll just use those best-of-breed applications!" But no! Those applications themselves turn into platforms. WordPress is practically a CMS. Trac too. Extensible applications, if successful, become their own platform. This is not to place blame, they aren’t necessarily any worse than any other platform, just an acknowledgment that this move to platform can happen anywhere.

Beyond Platforms, or A Better Platform

One of the major goals of Deliverance is to move beyond platforms. It is an integration tool, to allow applications from different frameworks or languages to be integrated gracefully.

There are only a few core reasons that people use platforms:

  1. A common look-and-feel across the site.
  2. Cohesive navigation.
  3. Indexing of the entire site.
  4. Shared authentication and user accounts.
  5. Cross-cutting functionality (e.g., commenting).

Deliverance specifically addresses 1, providing a common look-and-feel across a site. It can provide some help with 2, by allowing navigation to be more centrally managed, without relying purely on per-application navigation (though per-application navigation is still essential to navigating the individual applications). 3, 4, and 5 are not addressed by Deliverance (at least not yet).

Deliverance applies a common theme across all the applications in your site. It’s basic unit of abstraction is HTML. It doesn’t use a particular templating language. It doesn’t know what an object is. HTML is something every web application produces. Deliverance’s means of communication is HTTP. It doesn’t call functions or create request objects [*]. Again, everything speaks HTTP.

Deliverance also allows you to include output from multiple locations. In all cases there’s the theme, a plain HTML page, and the content, whatever the underlying application returns. You can also include output from other parts of the site, most commonly navigation content that you can manage separately. All of these pieces can be dynamic — again, Deliverance only cares about HTML and HTTP, it doesn’t worry about what produces the response.

This is all very similar to systems built on XSLT transforms, except without the XSLT [†], and without XML. Strictly speaking you can apply XSLT to any parseable markup, even HTML, but the most common (or at least most talked about) way to apply XSLT is using "semantic" XML output that is transformed into HTML. Deliverance does not try to understand the semantics of applications, and instead expects them to provide appropriate presentation of whatever semantics the underlying application possesses. Presentation is more universal than semantics.

While Deliverance does its best to work with applications as-they-exist, without making particular demands on those applications, it is not perfect. Conflicting CSS can be a serious problem. Some applications don’t have very good structure to work with. You can’t generate any content in Deliverance, you can only manipulate existing content, and often that means finding new ways to generate content, or making sure you have a place to store your content (as in the case of navigation). This is why arguably Deliverance does not remove the need for a platform, but is just its own platform. In so far as this is true, Deliverance tries to be a better platform, where "better" is "more universal" rather than "more powerful". Most templating systems are more powerful than Deliverance transformations. It can be useful to have access to the underlying objects used to procude the markup. But Deliverance doesn’t give you these things, because it only implements things that can be applied to any source of content. Static files are entirely workable in Deliverance, just as any application written in Python, PHP, or even an application hosted on an entirely separate service is usable through Deliverance.

The Missing Parts

As mentioned before, two important benefits of a platform are missing from Deliverance. I’ll try to describe what I believe are the essential aspects. I hope at some time that Deliverance or some complementary application will be able to satisfy these needs. Also, I suggest some lines of development that might be easier than others.

Indexing The Entire Site

Typically each application has a notion of what all the interesting pages in that application are. Most applications have a set of uninteresting pages, or transient pages. A search result is transient, as an example. An application also knows when new pages appear, and when other pages disappear. A site-wide index of these pages would allow things like site maps, cross-application search, and cross-application reporting to be done.

An interesting exception to the knowledge an application has of itself: search results are generally boring. But a search result based on a category might still be interesting. The difference between a "search" and a "report" is largely in the eye of the beholder. An important feature is that the application shouldn’t be the sole entity allowed to mark interesting pages. Manually-managed lists of resources that may point to specific applications can allow people to usefully and easily tweak the site. Ideally even fully external resources could be included, such as a resource on an entirely different site.

To do indexing you need both events (to signal the creation, update, or deletion of an entity/page), and a list of entities (so the index can be completely regenerated). A simple way of giving a list of entities would be the Google Site Map XML resource. Signaling events is much more complex, so I won’t go into it in any greater depth here, but we’re working on a product called Cabochon to handle events.

One thing that indexing can provide is a way to use microformats. Right now microformats are interesting, but for most sites they are largely useless. You can mark up your content, but no one will do anything interesting with that markup. If you could easily code up an indexer that could keep up-to-date on all the content on your site, you could produce interesting results like cross-application mapping.

Shared Authentication And User Accounts

Authentication is one of the most common and annoying integration tasks when crossing platform boundaries. Systems like Open ID offer the ability to unify cross-site authentication, but they don’t actually solve the problem of a single site with multiple applications.

There is a basic protocol in HTTP for authentication, one that is workable for a system like Deliverance, and there are already several existing products (like repoze.who) that work this way. It works like this:

  • The logged-in username is sent in some header, e.g., X-Remote-User. Some kind of signing is necessary to really trust this header (Deliverance could filter out that header in incoming requests, but if you removed Deliverance from the stack you’d have a security hole).
  • If the user isn’t logged in, and the application wants them to log in, the application response with a 401 Unauthorized response. It is supposed to set the WWW-Authenticate header, probably to some value indicating that the intermediary should determine the authentication type. In some cases a kind of HTTP authentication is required (typically Basic or Digest) because cookie-based logins are too stateful (e.g., in APIs, or for WebDAV access).
  • The intermediary catches the 401 and initiates the login process. This might mean a redirect to a login page, and setting a cookie on successful login. The login page and setting the cookie could potentially be done by an application outside of the intermediary; the intermediary only has to do the appropriate redirects and setting of headers.
  • In the case when a user is logged in but isn’t permitted, the application simply sends a 403 Forbidden response. The intermediary shouldn’t actually do anything in this case (though maybe it could usefully add a logout link to that message). I only mention this because some systems use 401 for Forbidden, which causes no end of problems.

While some applications allow for this kind of authentication scheme, many do not. However, the scheme is general enough that I think it is justifiable that applications could be patched to work like this.

This handles shared authentication, but the only information handed around is a username. Information about the user — the real name, email, homepage, permission roles, etc — are not shared in this model.

You could add something like an internal location to the username. E.g.: X-Remote-User: bob; info_url=http://mysite.com/users/bob.xml. It would be the application’s responsibility to make a subrequest to fetch that information. This can be somewhat inefficient, though with appropriate caching perhaps it would be fine. But many applications want very much to have a complete record of all users. Changing this is likely to be much harder than changing the authentication scheme. A more feasible system might be something on the order of what is described in Indexing the Entire Site: provide a complete listing of the site as well as events when users are created, updated, or deleted, and allow applications to maintain their own private but synced databases of users.

A common permission system is another level of integration. One way of handling this would be if applications had a published set of actions that could be performed, and the person integrating the application could map actions to roles/groups on the system.

Cross-cutting Functionality

This item requires a bit of explanation. This is functionality that cuts across multiple parts of the site. An example might be comments, where you want a commenting system to be applicable to a variety of entities (though probably not all entities). Or you might want page-update notification, or to provide a feed of changes to the entity.

You might also want to include some request logger like Google Analytics to all pages, but this is already handled well by Deliverance theming. Deliverance’s aggregation handles universal content well, but it doesn’t handle content (or subrequests) that should only be present in a portion of pages.

One possible way to address this is transclusion, where a page can specifically request some other resource to be included in the page. A simple subrequest could accomplish this, but many applications make it relatively easy to include some extra markup (e.g., by editing their templates) but not so easy to do something like a subrequest. We’ve written a product Transcluder to use an HTML format to indicate transclusion.

It’s also possible using Deliverance that you could implement this functionality without any application modification, though it means added configuration — an application written to be inserted into a page via Deliverance, and a Deliverance rule that plugs everything together (but if written incorrectly would have to be debugged).

Other Conventions

In addition to this, other platform-like conventions would make the life of the integrator much easier.

Template Customization

While Deliverance handles the look-and-feel of a page, it leaves the inner chunk of content to the application. If you want to tweak something small you will still need to customize the template of the application.

It would be wonderful if applications could report on what files were used in the construction of a request, and used a common search path so you could easily override those files.

Backups and Other Maintenance

Process management can be handled by something like Supervisor, and maybe in the future Deliverance will even embed Supervisor.

But even then, regular backups of the system are important. Typically each application has its own way of producing a backup. Conventions for producing backups would be ideal. Additional conventions for restoring backups would be even better.

Many systems also require periodic maintenance — compacting databases, checking for any integrity problems, etc. Some unified cron-like system might be handy, though it’s also workable for applications to handle this internally in whatever ad hoc way seems appropriate.

Common Error Reporting

With a system where one of many components can fail, it’s important to keep track of these problems. If errors just end up in one of 10 log files, it’s unlikely anyone is closely tracking them.

One product we’re working on to help with this is ErrorEater, which works along with Supervisor. Applications have to be modified to emit errors in a specific format that Supervisor understands, but this is generally not too difficult.

Farming

Application farming is when one instance of an application can support many "sites". These might be sites with their own domains, or just distinct projects. Examples are Trac, which supports multiple projects in one instance, or WordPress MU which supports many WordPress instances running off a single database and code base.

It would be nice if you could add a simple header to a request, like X-Project-Name: foo and that would be used by all these products to select the site (or sub-site or project or any other organization unit). Then mapping domain names, paths, or other aspects of a request to the project could be handled once and the applications could all consistently consume it.

(Internally for openplans.org we’re using X-OpenPlans-Project and custom patches to several projects to support this, but it’s all ad hoc.)

Footnotes

[*] This isn’t entirely true, Deliverance internally uses WSGI which is a Python-level abstraction of HTTP calls.
[†] At different times in the past, in an experimental branch right now, and potentially integrated in the future, Deliverance has been compiled down to XSLT rules. So Deliverance could be seen even as an simple transformation language that compiles down to XSLT.

HTML
Programming
Python
Web

Comments (2)

Permalink

Inverted Partials

I was talking with a coworker some time ago about his project, and he needed to update a piece of the page in-place when you go back to the page, and setting the page as uncacheable didn’t really work. Which probably makes sense; I think at one time browsers did respect those cache controls, but as a result going back in history through a page could cause some intermediate page to be refreshed and needlessly slow down your progress.

Anyway, Rails uses partials to facilitate this kind of stuff in a general way. Bigger chunks of your page are defined in their own template, and instead of rendering the full page you can ask just for a chunk of the page. Then you do something like document.getElementById('some_block').innerHTML = req.responseText. Mike Bayer just described how to do this in Mako too, using template functions.

When asked, another technique also occurred to me, using just HTML. Just add a general way of fetching an element by ID. At any time you say "refresh the element with id X", and it asks the server for the current version of that element (using a query string variable document_id=X) and replaces the content of that element in the browser.

The client side looks like this (it would be much simpler if you used a Javascript library):


function refreshId(id) {
    var el = document.getElementById(id);
    if (! el) {
        throw("No element by id '" + id + "'");
    }
    function handler(data) {
        if (this.readyState == 4) {
            if (this.status == 200) {
                el.innerHTML = this.responseText;
            } else {
                throw("Bad response getting " + idURL + ": "
                      + this.status);
            }
        }
    }
    var req = new XMLHttpRequest();
    req.onreadystatechange = handler;
    var idURL = location.href + '';
    if (idURL.indexOf('?') == -1) {
        idURL += '?';
    } else {
        idURL += '&amp;#038;';
    }
    idURL += 'document_id='+escape(id);
    req.open("GET", idURL);
    req.send();
}
 

Then you need the server-side component. Here’s something written for Pylons (using lxml.html, and Pylons 0.9.7 which is configured to use WebOb):


from pylons import request, response
from lxml import html

def get_id(response, id):
    if (response.content_type == 'text/html'
        and response.status_int == 200):
        doc = html.fromstring(response.body)
        try:
            el = doc.get_element_by_id(id)
        except KeyError:
            pass
        else:
            response.body = html.tostring(el)
    return response

class BaseController(WSGIController):
    def __after__(self):
        id = req.GET.get('document_id')
        if id:
            get_id(response, id)
 

Though I’m not sure this is appropriate for middleware, you could do it as middleware too:


from webob import Request
class DocumentIdMiddleware(object):
    def __init__(self, app):
        self.app = app
    def __call__(self, environ, start_response):
        req = Request(environ)
        id = req.GET.get('document_id')
        if not id:
            return self.app(environ, start_response)
        resp = req.get_response(self.app)
        resp = get_id(resp, id)
        return resp(environ, start_response)
 

HTML
Programming
Python

Comments (6)

Permalink

Python HTML Parser Performance

In preparation for my PyCon talk on HTML I thought I’d do a performance comparison of several parsers and document models.

The situation is a little complex because there’s different steps in handling HTML:

  1. Parse the HTML
  2. Parse it into something (a document object)
  3. Serialize it

Some libraries handle 1, some handle 2, some handle 1, 2, 3, etc. For instance, ElementSoup uses ElementTree as a document, but BeautifulSoup as the parser. BeautifulSoup itself has a document object included. HTMLParser only parses, while html5lib includes tree builders for several kinds of trees. There is also XML and HTML serialization.

So I’ve taken several combinations and made benchmarks. The combinations are:

  • lxml: a parser, document, and HTML serializer. Also can use BeautifulSoup and html5lib for parsing.
  • BeautifulSoup: a parser, document, and HTML serializer.
  • html5lib: a parser. It has a serializer, but I didn’t use it. It has a built-in document object (simpletree), but I don’t think it’s meant for much more than self-testing.
  • ElementTree: a document object, and XML serializer (I think newer versions might include an HTML serializer, but I didn’t use it). It doesn’t have a parser, but I used html5lib to parse to it. (I didn’t use the ElementSoup.)
  • cElementTree: a document object implemented as a C extension. I didn’t find any serializer.
  • HTMLParser: a parser. It didn’t parse to anything. It also doesn’t parse lots of normal (but maybe invalid) HTML. When using it, I just ran documents through the parser, not constructing any tree.
  • htmlfill: this library uses HTMLParser, but at least pays a little attention to the elements as they are parsed.
  • Genshi: includes a parser, document, and HTML serializer.
  • xml.dom.minidom: a document model built into the standard library, which html5lib can parse to. (I do not recommend using minidom for anything — some reasons will become apparent in this post, but there are many other reasons not covered why you shouldn’t use it.)

I expected lxml to perform well, as it is based on the C library libxml2. But it performed better than I realized, far better than any other library. As a result, if it wasn’t for some persistent installation problems (especially on Macs) I would recommend lxml for just about any HTML task.

You can try the code out here. I’ve included all the sample data, and the commands I ran for these graphs are here. These tests use a fairly random selection of HTML files (355 total) taken from python.org.

Parsing

lxml:0.6; BeautifulSoup:10.6; html5lib ElementTree:30.2; html5lib minidom:35.2; Genshi:7.3; HTMLParser:2.9; htmlfill:4.5

The first test parses the documents. Things to note: lxml is 6x faster than even HTMLParser, even though HTMLParser isn’t doing anything (lxml is building a tree in memory). I didn’t include all the things html5lib can parse to, because they all take about the same amount of time. xml.dom.minidom is only included because it is so noticeably slow. Genshi is fairly fast, but it’s the most fragile of the parsers. html5lib, lxml, and BeautifulSoup are all fairly similarly robust. html5lib has the benefit of (at least in theory) being the correct parsing of HTML.

While I don’t really believe it matters often, lxml releases the GIL during parsing.

Serialization

lxml:0.3; BeautifulSoup:2.0; html5lib ElementTree:1.9; html5lib minidom:3.8; Genshi:4.4

Serialization is pretty fast across all the libraries, though again lxml leads the pack by a long distance. ElementTree and minidom are only doing XML serialization, but there’s no reason that the HTML equivalent would be any faster. That Genshi is slower than minidom is surprising. That anything is worse than minidom is generally surprising.

Memory

lxml:26; BeautifulSoup:82; BeautifulSoup lxml:104; html5lib cElementTree:54; html5lib ElementTree:64; html5lib simpletree:98; html5lib minidom:192; Genshi:64; htmlfill:5.5; HTMLParser:4.4

The last test is of memory. I don’t have a lot of confidence in the way I made this test, but I’m sure it means something. This was done by parsing all the documents and holding the documents in memory, and using the RSS size reported by ps to see how much the process had grown. All the libraries should be imported when calculating the baseline, so only the documents and parsing should cause the memory increase.

HTMLParser is a baseline, as it just keeps the documents in memory as a string, and creates some intermediate strings. The intermediate strings don’t end up accounting for anything, since the memory used is almost exactly the combined size of all the files.

A tricky part of this measurement is that the Python allocator doesn’t let go of memory that it requests, so if a parser creates lots of intermediate strings and then releases them the process will still hang onto all that memory. To detect this I tried allocating new strings until the process size grew (trying to detect allocated but unused memory), but this didn’t reveal much — only the BeautifulSoup parser, serialized to an lxml tree, showed much extra memory.

This is one of the only places where html5lib with cElementTree was noticeably different than html5lib with ElementTree. Not that surprising, I guess, since I didn’t find a coded-in-C serializer, and I imagine the tree building is only going to be a lot faster for cElementTree if you are building the tree from C code (as its native XML parser would do).

lxml is probably memory efficient because it uses native libxml2 data structures, and only creates Python objects on demand.

In Conclusion

I knew lxml was fast before I started these benchmarks, but I didn’t expect it to be quite this fast.

So in conclusion: lxml kicks ass. You can use it in ways you couldn’t use other systems. You can parse, serialize, parse, serialize, and repeat the process a couple times with your HTML before the performance will hurt you. With high-level constructs many constructs can happen in very fast C code without calling out to Python. As an example, if you do an XPath query, the query string is compiled into something native and traverses the native libxml2 objects, only creating Python objects to wrap the query results. In addition, things like the modest memory use make me more confident that lxml will act reliably even under unexpected load.

I also am more confident about using a document model instead of stream parsing. It is sometimes felt that streamed parsing is better: you don’t keep the entire document in memory, and your work generally scales linearly with your document size. HTMLParser is a stream-based parser, emitting events for each kind of token (open tag, close tag, data, etc). Genshi also uses this model, with higher-level stuff like filters to make it feel a bit more natural. But the stream model is not the natural way to process a document, it’s actually a really awkward way to handle a document that is better seen as a single thing. If you are processing gigabyte files of XML it can make sense (and both the normally document-oriented lxml and ElementTree offer options when this happens). This doesn’t make any sense for HTML. And these tests make me believe that even really big HTML documents can be handled quite well by lxml, so a huge outlying document won’t break a system that is appropriately optimized for handling normal sized documents.

HTML
Programming
Python

Comments (39)

Permalink

HTML Accessibility

So I gave a presentation at PyCon about HTML, which I ended up turning into an XML-sucks HTML-rocks talk. Well that’s a trivialization, but I have the privilege of trivializing my arguments all I want.

Somewhat to my surprise this got me a heckler (of sorts). I think it came up when I was making my <em> lies and <i> is truth argument. That is, presentation and intention are the same. There are those people who feel they can separate the two, creating semantic markup that represents their intent, but they are so few that the reader can never trust that the distinction is intentional, and so <i> and <em> must be treated as equivalent.

Someone then yelled out something like "what about blind people?" The argument being that screen readers would like to distinguish between the two, as not all things we render as italic would be read with emphasis.

It’s not surprising to me that the first time I’ve gotten an actively negative reaction to a talk it was about accessibility. When having technical discussions it’s hard to get that heated up. Is Python or Ruby better? We can talk shit on the web, where all emotions get mixed up and weirded, but in person these discussions tend to be quite calm and reasonable.

Discussions about accessibility, however, have strong moral undertones. This isn’t just What Tool Is Right For The Job. There is a kind of moral certainty to the argument that we should be making a world that is accessible to all people.

I fear this moral certainty has led people self-righteously down unwise paths. They believe — with of course some justification — that the world must be made right. And so many boil-the-ocean proposals are made, and even become codified by standards, but markup standards are useless unless embodied in actual content, and this is where accessibility falls down.

There are two posts that together have greatly eroded my trust in accessibility advocates, so that I feel like I am left adrift, unwilling to jump through the hoops accessibility advocates put up as I strongly suspect they are pointless.

The first post is about the longdesc attribute, an obscure attribute intended to tell the story of a picture. Where alt is typically used as a placeholder for the image, and a short description, longdesc can point to a document that describes the image in length. Empirically they (Ian Hickson in particular) found that the attribute was almost never used in a useful or correct way, rendering it effectively useless. If the discussion had clearly ended at this point, I would have deducted points for those people use advocated longdesc based on bad judgement, but it would not have effected my trust because anyone can mispredict. But the comments just seemed to reinforce the belief that because it should work, that it would work.

The second post was Ian Hickson’s description of using a popular screen reader (JAWS) — you’ll have to dig into the article some, as it’s embedded in other wandering thoughts. In summary, JAWS is a horrible experience, and as an example it didn’t even understand paragraph breaks (where the reader would be expected to pause). What’s the point of semantic markup for accessibility when the most basic markup that is both presentation and semantic (<p>) is ignored? Ian’s brief summary is that if you want to make your page readable in JAWS you’d do better by paying attention to punctuation (which does get read) than to markup. And if you want to help improve accessibility, blind people need a screen reader that isn’t crap.

Months later we started talking a bit about the accessibility of openplans.org. Everyone wants to do the right thing, no? With my trust eroded, I argued strongly that we should only implement accessibility empirically, not based on "best practices". Well, barring some patterns that seem very logical to me, like putting navigation textually at the bottom of the page, and other stuff that any self-respecting web developer does these days. But except for that, if we want to really consider accessibility we should get a tool and use it. But I don’t really know what that tool should be; JAWS is around $1000, all for what sounds like a piece of crap product. We could buy that, even though of course most web developers couldn’t possibly justify the purchase. But is that really the right choice? I don’t know. If we could detect something in the User-Agent string we could see what our users actually use. But I don’t think there’s information there. And I don’t know what people are using. Optimizing for screen magnifiers is much different that optimizing for screen readers.

Another shortcut for accessibility — a shortcut I also distrust — is that to make a site accessible you make sure it works without Javascript. But don’t many screen readers work directly off browsers? Browsers implement Javascript. Do blind users turn Javascript off? I don’t know. If you use no-Javascript as a hint to make the site more accessible, you might just be wasting your effort.

There’s also some weird perspective problems with accessibility. Blind users will always be a small portion of the population. It’s just unreasonable to expect sighted users to write to this small population. Relying on hidden hints in content to provide accessibility just can’t work. Hidden content will be broken, only visible content can be trusted. Admitting this does not mean giving up. As a sighted reader I do not expect the written and spoken word to be equivalent. I don’t think blind listeners lose anything by hearing something that is more a dialect specific to the computer translation of written text to spoken text. (Maybe treating text-to-speech as a translation effort would be more successful anyway?)

A freely available screen reader probably would help a lot as well. I write my markup to render in browsers, not to render to a spec. Anything else is just bad practice. I can’t seriously write my markup for readers based on a spec.

HTML
Programming
Web

Comments (9)

Permalink