Ian Bicking: the old part of his blog

My first bit of ElementTree

There was a discussion about htmlfill and FormEncode on the Subway list a while ago. One of the things that occurred to me during the discussion is that htmlfill would be a lot simpler and more reliable if it didn't use HTMLParser, and just worked with a nice DOM-ish tree.

And then I thought, well, if you are going to generate a form and then pass it to htmlfill (one of a couple options), wouldn't it be nice if you passed in the already-parsed tree, instead of reparsing? Saves a few cycles at least.

In FormEncode I made a little module simplehtmlgen to generate the HTML -- it's kind of like HTMLGen, but a little more isomorphic to HTML/XML. More like stan, really. Well, I could use stan (which also produces a DOM-like object), but I decided to try ElementTree instead, since I feel vaguely like it's growing standard for Pythonic XML. It's not perfect for my purposes -- it might be too XML, where I would prefer a more lax perspective that would better accommodate HTML.

Anyway, I wrote a module for ElementTree, htmlgen. You use it like:

html.textarea(name='entry', class_='big_field')(text_content)

And you get back subclasses of ElementTree's Elements (which you can continue to call to add more attributes or content to). The subclass also adds a __str__ method which serializes the XML (using a default encoding -- I'm not 100% comfortable with a default encoding, but it seems like a good idea to my naive unicode mind). Anyway, about ElementTree...

One of the odd parts of ElementTree is how it deals with text. Tags have a text attribute, which is the text immediately contained in the tag, and a tail attribute which contains the text immediately after the tag ends. There's no text node or text structure that is a child of another tag. There's also no object to represent a set of nodes (except a normal list) so I had to be careful to flatten lists (since I do want to handle sets of tags that aren't a valid XML tree). Anyway, I think this library simplifies some of that, things you'd mostly notice if you are building trees with ElementTree instead of parsing XML documents.

Another odd thing is that there's no way to serialize nodes to unicode -- to do that I had to serialize them to bytes and then decode to unicode. Seemed like a weird omission. And you can't put in any kind of unparsed literal into the tree, you can only put real nodes in, so there's no way to make a literal class/function/builder. This makes sense from a parsing point of view (since you couldn't reparse the serialized output if it wasn't valid), but is a common feature of HTML builders. Instead I guess you just have to parse XML strings before inserting them, which is easy enough.

One positive point (which from another perspective might be a negative) ElementTree doesn't seem very namespace-aware, so I can create tags and attributes with : in them (which means I can generate ZPT).

I feel a little badly about subclassing Element (technically _ElementInterface), because it means there's a more-featureful class of nodes that can easily be mixed in with a less-featureful class (or vice versa). The builder syntax isn't a big deal -- there's no real reason to use that in lieu of the normal methods when manipulating a tree that is already created. But things like __str__ are likely to be useful, but at the same time limiting if you depend on them.

Created 24 Feb '05

Comments:

"Another odd thing is that there's no way to serialize nodes to unicode --"..

this could help..

from cElementTree import Element, tostring

html = Element('html')

tostring(html, encoding='utf-8') # -> returns html node serialized into a string using specified encoding

# daf

Right, which is why I implemented a __unicode__ method like:

def __unicode__(self):
    return tostring(self, 'utf-8').decode('utf-8')

Doesn't that seem really weird, though?

# Ian Bicking

Unfortunately, htmlgen is already taken in the Python HTML generator namespaces. It's a really old (but still fairly widely used) module available on the starship.

# David Ascher

Yes, but it's HTMLgen not htmlgen... ;) Probably mine should be called etgen or something. Anyway, for now it's part of a package, so there's no real name conflict.
# Ian Bicking

"htmlfill would be a lot simpler and more reliable if it didn't use HTMLParser"

Guess it's probably obvious to those that have already realised it but only recently occurred to me that parsing HTML could be alot easier (particularily when it comes to perserving things like whitespace) with a specific parser that gets only what you need while regarding the rest of the document as "just text", vs. a generic HTML parser which is aware of the complete vocabulary.

In PHP there's an excellent lexing tool in SimpleTest which uses "parallel regular expressions" - http://cvs.sourceforge.net/viewcvs.py/simpletest/simpletest/parser.php?view=markup which would suit the job. Don't know what's available for lexing in Python so well but seems like SimpleParse might do the job.

# Harry Fuecks

That's kind of what htmlfill does -- it lets HTMLParser parse the tags, but it just echos out all the parts inbetween the elements it cares about. There's a problem with it eating newlines, but otherwise it seems to work fine. BeautifulSoup is another HTML parser that on a fairly low level.
# Ian Bicking

Sorry - explained myself badly.

Was referring to the process of lexing the raw text in the first place. Rather than using characters like > and < to find tokens, as is common in most HTML parsers (HTMLParser and sgmllib seem to do this), look for specific tags by name while treating all else as unintesting plain text, although it may contain HTML tags we're no interested in. In this case it might amount to some fairly simple regular expressions.

# Harry Fuecks

I thought the 'text' attribute thing with ElementTree was a little odd at first as well. However, beyound this, I think ElementTree is Pythonic and quite handy. I have written dozens of parsers and generators with ElementTree and can no complain. Also, I do use http://dustman.net/andy/python/HyperText which reminds me a bit of what your doing here.
# Brian Ray