Thursday, August 2nd, 2007

Atom Models

I’ve been doing a bit more with Atom lately.

First, I started writing a library to manipulate Atom feeds and entries. For the moment this is located in atom.py. It uses lxml, as does everything markup related I do these days.

I came upon a revelation of sorts when I was writing the library. I first started with a library that looked like this:

class Feed(object):
    def __init__(self, title, ...):
        self.title = title
        ..
    @classmethod
    def parse(cls, xml):
        if isinstance(xml, basestring):
            xml = etree.XML(xml)
        title = xml.xpath('//title').text
        ...
        return cls(title, ...)
    def serialize(self):
        el = etree.Element('{%s}feed' % atom_ns)
        title = etree.Element('{%s}title' % atom_ns)
        title.text = self.title
        el.append(title)
        ...
        return el

Obviously there’s ways to improve this and make it less verbose, and I went down that path for a while. But then I decided the whole path was wrong. Atom is XML. It’s not the representation of some object I’m creating. If I have something that can’t be represented in XML, it isn’t Atom, and it doesn’t belong in my Atom-related objects.

So instead I started making lxml more convenient when using Atom. I don’t keep any information except what is in the markup, I just make it more convenient to access that information.

I used lots of descriptors to do this, as the same patterns happened over and over. For instance, the Feed object is fairly simple:

class Feed(AtomElement):
    entries = _findall_property('entry')
    author = _element_property('author')

Which basically means that feed.entries returns all <entry> elements, and feed.author returns the single author element.

There’s also accessors for text elements (like <id>) and date containing elements (like <updated>) and just to access XML attributes as Python attributes.

There’s a number of advantages:

  • No hidden state.
  • No deferred errors, since everything is always represented in the XML infoset.
  • All XML extensions work, even though my classes don’t know anything in particular about them. There’s a full API for manipulating the XML that you can use, you don’t have to use my APIs.
  • Even more obscure kinds of extensions work fine, like a custom attribute on an element. There’s absolutely zero normalization that happens.
  • I only have to write the parts where the normal XML (lxml) APIs are inconvenient, so the implementation stays simple.
  • There’s no confusion over which object I might be talking about in my code. There’s no distinction between the XML object and the domain object.

Since then I’ve been working on a Javascript library for handling Atom. It’s not as elegant. I am trying to keep to this same principle, but of course I can’t actually extend the DOM and so I can’t add convenience methods. So instead I’m making a class that lightly wraps the DOM objects, with explicit getters and setters that simply read and modify those DOM objects.

One thing that I have found very useful in my development on the Javascript side is doctest-style testing. You can see the test, but to run it you have to check it out (it uses some svn:externals which you don’t get through the direct svn access). After using that testing some more and being pleased with the result, I decided to package the Javascript doctest runner a bit better. I removed the framework dependencies, did a bit of renaming (now it is doctestjs or doctest.js instead of jsdoctest), wrote up fairly comprehensive docs, and uploaded it to JSAN (though at the moment the trunk from svn is probably better to use). I think it’s an excellent way of doing unit testing in Javascript, much better than any of the alternatives I’ve seen. It even has some notable advantages over Python’s doctest, like if you are using Firebug (which you must if you do Javascript development) you get a console session that runs in the same namespace as your tests, so you can easily do inspection of the objects if there’s a failure.

I’m not sure about JSAN. It’s nice to have an index. But I think they copy stuff from CPAN a bit too much. Why should you have a text README file? That’s just silly; of course Javascript documentation should be HTML. They batch processing. Processing one package a day on the fly shouldn’t be overwhelming. They want a MANIFEST file. The standard metadata file is YAML, not JSON. This should all be a little more Javascripty in my opinion. But they also accept any kind of upload, so there’s nothing stopping you from ignoring what you don’t care about. I’ll probably improve the packaging of doctestjs a bit in the future, and still ignore the parts I think are silly.

This is the personal site of Ian Bicking. The opinions expressed here are my own.