HTML

I've been doing a lot with HTML lately. HTML is great, because it's important. (Hmm... this statement sounds kind of weird. Because HTML is important, if you don't understand why it is great you need to think harder. Now the statement sounds weird, but in a deep way)

I've started working with microformats some. This initially started with an OLPC project, where I wrote an hReview parser (hReview is a microformat).
Then related to that, I wanted to download a bunch of pages in an isolated form, so all the internal links are fixed up. This became PageCollector. Since then I've added a script frontend. I think this could be a generally handy export system. It doesn't just make links relative, it also lets you completely remap filenames and is fairly careful about links (including links in CSS).
Doing all this HTML-related stuff, I've been using lxml a lot (and it should be inferred I therefor also like libxml2 a lot). I like lxml a lot; probably the biggest reason is that it has a fast permissive HTML parser. But the ElementTree API that lxml uses can be a bit awkward for HTML -- it works well for XML, but in HTML there's lots of mixing of text and tags, and that feels really weird in the ElementTree API. So I've started lxml.html in a branch with lots of feedback from Stefan Behnel. This adds methods that are useful in HTML. With just a few smallish methods, I think it makes HTML much easier to work with in lxml.
Though BeautifulSoup is nice, it's not a C-based HTML parser. With Deliverance we are literally parsing and rewriting every page as it comes through our site. I couldn't do this with BeautifulSoup. And we're not doing any crappy regex stuff; lxml lets us actually understand the HTML the way the browser understands the HTML. That's important. This is kind of the promise of XHTML -- the ability to create smart and useful intermediaries -- that XHTML never really fulfilled. The idea of smart intermediaries is a powerful one, XHTML just wasn't the right basis; intermediaries are supposed to work with real endpoints, and those endpoints were and continue to be HTML, not XHTML. html5lib is cool too as an HTML parser, and important as a reference implementation, but I'll mostly be waiting for that work to filter back into something like libxml2.
I should note that I think the stream model of handling HTML (as HTMLParser uses) is lame. Typical HTML is not very big, and a document model (where the whole document is loaded into memory) is way easier to handle. HTML is best understood as a document. (Genshi would probably be rockin' fast if it used lxml)
As part of lxml.html I've been writing an HTML cleaning module. It's a bit messy, and incomplete, but I think it will be really useful. Because we fully parse and then serialize the HTML, a lot of XSS attacks are avoided really early. At work David Turner has been working on a Transclusion intermediary called Transcluder. It's a little frightening to fold together disparate pieces of HTML; this might make it less frightening.
I took the pattern from formencode.doctest_xml_compare and also ported it to lxml. Now you can do semantic HTML and XML diffs in your doctests, instead of literal string comparison. This means that things like the order of your attributes are ignored, as is whitespace. And the diffs help highlight problems in a more fine-grained way. Using the html branch you can do from lxml import usedoctest or from lxml.html import usedoctest to enable the comparison inside a doctest. And you don't need a custom doctest runner. I managed to do this using horrible horrible monkeypatching. Also from FormEncode I'm hoping to port htmlfill over to lxml.html sometime too.
With Luke Tucker we wrote an html diff module, which I've also put into lxml.html. This shows textual differences while trying to avoid markup differences (since it's really hard to visually represent changes in markup, and also usually not what people care about). This is surprisingly difficult because you also need to preserve the markup -- you can't just smash all the words together. I think it's a pretty good implementation, and there aren't that many implementations of this out there. You can also do blame style annotations, showing who added what over the history of the page.

Open Plans

Stuff from work (which is a little vague, because work, private projects, HTML, OLPC, all overlap a lot).

Deliverance is deployed on the live site. We haven't really used its functionality much yet. But it's awesome functionality once we get around to using it.
Related to that deployment, I spent some time with paste.httpserver (trunk only) on the thread pool. It now adapts the thread pool to add threads when necessary, kill threads that seem to be wedged, and various tools around that. This recipe was handy, as now I can kill threads. It's surprisingly slow, though -- it frequently takes several minutes for the thread to be killed. But it shows up as a regular exception, which is great. You can read about the details in this document.
I've been extracting some stuff into WSGIProxy for your non-browser-based proxying. This is mostly for handling HTTP-based dispatch between server applications.
Not really for work, but related to proxying I whipped together a little HTTP proxy using Paste and WSGI middleware. It kind of works. It was just a prototype, so I haven't worked on it more, but with just a little more work I think it should be easy to turn any WSGI middleware into an HTTP proxy. This is the kind of proxy you configure in your browser.
A followup to the HTML WYSIWYG post, we decided to use Xinha. I feel pretty good about the decision, which is primarily motivated by Xinha's UI -- in small but important ways it feels like the best UI among the bunch. It's just more polished and higher quality than the other options. It's also a Real Open Source Project, which matters to us. But we haven't actually implemented this yet.
I wrote a total hack of doctest that lets you do from dtopt import ELLIPSIS, and then the ELLIPSIS option (that lets you use ... as a wildcard in output) will be enabled for the rest of your doctest. It's ugly, but I might add a few more similar hacks to that package and make it into something more real. But still ugly.
With Whit Morris I've been working on a Tagging application. It's just getting started, so nothing working yet. I like the atom module, which represents Atom feeds and entries as plain XML. This didn't occur to me at first -- I was going to create a Python representation of the Atom data model and then add parse and serialize methods to it. This is wrong, because the Atom data model is XML. If you add a funky custom element to your entry and I normalize it into oblivion, then I've made a crappy library; by making the XML the one and only source of information I avoid such things.
I wrote a little proof of concept for web application help called WebClippy (svn here). It's inspired by Hackety Hack (a neat Ruby educational environment) and how tutorials are written there -- just a very simple bit of persistent help. Writing it made me realize that S5 should totally be rewritten, because Javascript-based slideshow scripts are easy to write and S5 is annoying. Anyway, I need to actually try to document something with WebClippy, because my test slides are lame, and I've decided realistic examples are essential to good development.

OLPC

Stuff from One Laptop Per Child (besides the HTML experiments)...

I went to Boston for OLPC, and we talked a lot about Web-Based Annotation. We want there to be a way for children to add notes (shared or private) to any web page. There seems to be people from several directions working towards this, outside of OLPC. Annotation is something that seems to pop up every few years, then die down; maybe this time we can actually make it work. Maybe Web 2.0 should just be "good ideas that aren't new, but this time they might actually work".
I didn't do any of it, but PyXPCOM is in the Sugar browser now (Sugar is the OLPC UI). This excites me, except then I looked at the code and I couldn't figure out how to get access to any interesting XPCOM objects (the DOM in particular) -- XPCOM seems very indirect. Oh well, at least now it's possible and I just have to figure out the details. I hope/want/am-optimistic that OLPC will have a great browser experience. It doesn't yet.
We had a little OLPC sprint for Chicago people interested in the project. Again getting a working Sugar environment was a bit of a problem. Afterward I played around with some VMWare techniques. These need some polish, but I think it's the right direction and it's not that hard. If someone could get this working in a scripted way, I'm sure it could be added to the standard build process at OLPC.

Miscellaneous

I wrote WaitForIt, a little WSGI middleware that you can put in front of slow WSGI applications. If the application is too slow, it gives the user a page asking them to wait. This is particularly handy for things like administrative areas where you realize you are doing some really slow blocking operation and you don't want it to time out but you really don't feel like setting up a fancy queue for it. Just pop this in front and your problem is pretty much solved. This is part of a secret framework I'm developing.
I started revisiting buildutils and cleaning some stuff up. Mostly I've added some commands to make it easier to add and make new commands for setup.py. This relates to this post. I added a command to WebHelpers for Javascript compression.
It's a little old now, but I added a templating language to Paste. Using string.Template for the little internal templating things just wasn't good. Cheetah has been too hard to install. So this is a simple, single-file templating language based on string substitution. I should probably extract it and do a few more things with it. I feel a little dumb about writing Yet Another Templating Language, but I couldn't find a templating language that I liked for small text-related tasks. For actual web programming a more complete language is good.
I made a little package called CmdUtils to make it a bit easier to build command-line scripts, with the patterns that I personally like.
I kind of got a new encoding working. It encodes text to spaces, newlines, and tabs. Now whitespace can be even more significant! You can also put non-whitespace in the encoded text as a form of steganography.
Among my little recipes I added propertyclass for making properties. I also added a little script for pasting text to a pastebin (only for X users). And also one for creating easy_install links from a subversion repository.
I wrote about my desktop for a new blog that Steve Holden created.

Hmm... more stuff than I thought. Working at The Open Planning Project is great, because I spend all my time working on open source projects in public repositories. And also working on Stuff That Interests Me. If you've gotten an email from me recently, you may notice I link to our careers page in my sig, simply because I think everyone cool should work at TOPP. We're doing other interesting things too that I'm not directly involved with; I'll try to write about other people's projects before too long.

After vacation I should probably start looking more into getting WordPress on our site. TOPP is not technically tribalist! We do accept PHP, even if we don't really trust PHP. Once that's up, I'll move my own blog from this crappy piece of spam ridden crap (that I wrote, so no one but myself to blame), and then maybe I'll post more because I'll be less embarrassed about the software.

Ian Bicking: the old part of his blog

What I've Been Up To

HTML

Open Plans

OLPC

Miscellaneous

Comments: