What I've Been Up To
I'm off for a bit of vacation, but before then I thought I'd throw up
a bunch of links to stuff I've been doing in the last couple months:
I've been doing a lot with HTML lately. HTML is great, because it's
important. (Hmm... this statement sounds kind of weird. Because HTML is important,
if you don't understand why it is great you need to think harder. Now the statement
sounds weird, but in a deep way)
- I've started working with microformats some. This initially started
with an OLPC project, where I wrote an
(hReview is a
- Then related to that, I wanted to download a bunch of pages in an
isolated form, so all the internal links are fixed up. This became
then I've added a script frontend. I think this could be a
generally handy export system. It doesn't just make links relative,
it also lets you completely remap filenames and is fairly careful
about links (including links in CSS).
- Doing all this HTML-related stuff, I've been using lxml a lot (and it should be inferred I
therefor also like libxml2 a lot). I like
lxml a lot; probably the biggest reason is that it has a fast
permissive HTML parser. But the ElementTree API that lxml uses can
be a bit awkward for HTML -- it works well for XML, but in HTML
there's lots of mixing of text and tags, and that feels really weird
in the ElementTree API. So I've started lxml.html in a branch with lots
of feedback from Stefan Behnel. This adds methods that are useful
in HTML. With just a few smallish methods, I think it makes HTML
much easier to work with in lxml.
- Though BeautifulSoup is nice, it's not
a C-based HTML parser. With Deliverance we are literally
parsing and rewriting every page as it comes through our site. I
couldn't do this with BeautifulSoup. And we're not doing any crappy
regex stuff; lxml lets us actually understand the HTML the way the
browser understands the HTML. That's important. This is kind of
the promise of XHTML -- the ability to create smart and useful
intermediaries -- that XHTML never really fulfilled. The idea of
smart intermediaries is a powerful one, XHTML just wasn't the right
basis; intermediaries are supposed to work with real endpoints,
and those endpoints were and continue to be HTML, not XHTML.
html5lib is cool too as an
HTML parser, and important as a reference implementation, but I'll
mostly be waiting for that work to filter back into something like
- I should note that I think the stream model of handling HTML (as
is lame. Typical HTML is not very big, and a document model (where
the whole document is loaded into memory) is way easier to handle.
HTML is best understood as a document. (Genshi would probably be rockin' fast if it
- As part of lxml.html I've been writing an HTML cleaning module.
It's a bit messy, and incomplete, but I think it will be really
useful. Because we fully parse and then serialize the HTML, a lot
of XSS attacks are avoided really early. At work David Turner has
been working on a Transclusion intermediary called
It's a little frightening to fold together disparate pieces of HTML;
this might make it less frightening.
- I took the pattern from formencode.doctest_xml_compare and also
ported it to lxml. Now you can do semantic HTML and XML diffs in
your doctests, instead of literal string comparison. This means
that things like the order of your attributes are ignored, as is
whitespace. And the diffs help highlight problems in a more
fine-grained way. Using the html branch you can do from lxml
import usedoctest or from lxml.html import usedoctest to
enable the comparison inside a doctest. And you don't need a custom
doctest runner. I managed to do this using horrible horrible
monkeypatching. Also from FormEncode I'm hoping to port htmlfill over to lxml.html sometime
- With Luke Tucker we wrote an html diff module,
which I've also put into lxml.html. This shows textual differences
while trying to avoid markup differences (since it's really hard to
visually represent changes in markup, and also usually not what
people care about). This is surprisingly difficult because you also
need to preserve the markup -- you can't just smash all the words
together. I think it's a pretty good implementation, and there
aren't that many implementations of this out there. You can also do
blame style annotations, showing who added what over the history of
Stuff from work (which is a little
vague, because work, private projects, HTML, OLPC, all overlap a lot).
- Deliverance is
deployed on the live site. We haven't really used its
functionality much yet. But it's awesome functionality once we get
around to using it.
- Related to that deployment, I spent some time with
paste.httpserver (trunk only) on the thread pool. It now adapts
the thread pool to add threads when necessary, kill threads that
seem to be wedged, and various tools around that. This recipe was handy, as now I
can kill threads. It's surprisingly slow, though -- it frequently
takes several minutes for the thread to be killed. But it shows up
as a regular exception, which is great. You can read about the
details in this document.
- I've been extracting some stuff into WSGIProxy for your non-browser-based
proxying. This is mostly for handling HTTP-based dispatch between
- Not really for work, but related to proxying I whipped together a
little HTTP proxy using Paste
and WSGI middleware. It kind of works. It was just a prototype, so
I haven't worked on it more, but with just a little more work I
think it should be easy to turn any WSGI middleware into an HTTP
proxy. This is the kind of proxy you configure in your browser.
- A followup to the HTML WYSIWYG post, we
decided to use Xinha. I feel
pretty good about the decision, which is primarily motivated by
Xinha's UI -- in small but important ways it feels like the best UI
among the bunch. It's just more polished and higher quality than
the other options. It's also a Real Open Source Project, which
matters to us. But we haven't actually implemented this yet.
- I wrote a total hack of doctest that lets you do
from dtopt import ELLIPSIS, and then the ELLIPSIS option (that
lets you use ... as a wildcard in output) will be enabled for
the rest of your doctest. It's ugly, but I might add a few more
similar hacks to that package and make it into something more real.
But still ugly.
- With Whit Morris I've been working on a Tagging application. It's just
getting started, so nothing working yet. I like the atom module,
which represents Atom feeds and entries as plain XML. This didn't
occur to me at first -- I was going to create a Python
representation of the Atom data model and then add parse and
serialize methods to it. This is wrong, because the Atom data model
is XML. If you add a funky custom element to your entry and I
normalize it into oblivion, then I've made a crappy library; by
making the XML the one and only source of information I avoid such
- I wrote a little proof of concept for web application help called
It's inspired by Hackety Hack (a neat
Ruby educational environment) and how tutorials are written there --
just a very simple bit of persistent help. Writing it made me
realize that S5 should
easy to write and S5 is annoying. Anyway, I need to actually try to
document something with WebClippy, because my test slides are lame,
and I've decided realistic examples are essential to good
Stuff from One Laptop Per Child (besides the
- I went to Boston for OLPC, and we talked a lot about Web-Based
Annotation. We want there
to be a way for children to add notes (shared or private) to any web
page. There seems to be people from several directions working
towards this, outside of OLPC. Annotation is something that seems
to pop up every few years, then die down; maybe this time we can
actually make it work. Maybe Web 2.0 should just be "good ideas
that aren't new, but this time they might actually work".
- I didn't do any of it, but PyXPCOM is in the Sugar
browser now (Sugar is the OLPC UI). This excites me, except then I
looked at the code and I couldn't figure out how to get access to
any interesting XPCOM objects (the DOM in particular) -- XPCOM seems
very indirect. Oh well, at least now it's possible and I just
have to figure out the details. I hope/want/am-optimistic that OLPC
will have a great browser experience. It doesn't yet.
- We had a little OLPC sprint for Chicago people interested in the
getting a working Sugar environment was a bit of a problem.
Afterward I played around with some VMWare techniques.
These need some polish, but I think it's the right direction and
it's not that hard. If someone could get this working in a scripted
way, I'm sure it could be added to the standard build process at
- I wrote WaitForIt, a little
WSGI middleware that you can put in front of slow WSGI applications.
If the application is too slow, it gives the user a page asking them
to wait. This is particularly handy for things like administrative
areas where you realize you are doing some really slow blocking
operation and you don't want it to time out but you really don't
feel like setting up a fancy queue for it. Just pop this in front
and your problem is pretty much solved. This is part of a secret
framework I'm developing.
- I started revisiting buildutils and cleaning some
stuff up. Mostly I've added some commands to make it easier to add
and make new commands for setup.py. This relates to this post. I added a
command to WebHelpers
- It's a little old now, but I added a templating language to
Paste. Using string.Template for the little internal templating
things just wasn't good. Cheetah has been too hard to install. So
this is a simple, single-file templating language based on string
substitution. I should probably extract it and do a few more things
with it. I feel a little dumb about writing Yet Another Templating
Language, but I couldn't find a templating language that I liked for
small text-related tasks. For actual web programming a more
complete language is good.
- I made a little package called CmdUtils to make it a bit
easier to build command-line scripts, with the patterns that I
- I kind of got a new encoding working. It encodes
text to spaces, newlines, and tabs. Now whitespace can be even more
significant! You can also put non-whitespace in the encoded text as
a form of steganography.
- Among my little recipes I added
making properties. I also added a little script for pasting
text to a pastebin (only for X users). And also one for creating
easy_install links from a subversion repository.
- I wrote about my desktop for a
new blog that Steve Holden created.
Hmm... more stuff than I thought. Working at The Open Planning
Project is great, because I spend all my
time working on open source projects in public repositories. And also
working on Stuff That Interests Me. If you've gotten an email from me
recently, you may notice I link to our careers page in my sig, simply because I
think everyone cool should work at TOPP. We're doing other
interesting things too that I'm not directly involved with; I'll try
to write about other people's projects before too long.
After vacation I should probably start looking more into getting
WordPress on our site. TOPP is not technically tribalist! We do
accept PHP, even if we don't really trust PHP. Once that's up, I'll
move my own blog from this crappy piece of spam ridden crap (that I
wrote, so no one but myself to blame), and then maybe I'll post more
because I'll be less embarrassed about the software.
Created 07 Jun
Modified 08 Jun
html5lib is cool too as an HTML parser, and important as a reference implementation, but I'll mostly be waiting for that work to filter back into something like libxml2.
I should stress html5lib is not a reference implementation in the traditional sense of the word; it's the first public implementation of the HTML 5 spec and has a pretty valuable set of tests but it has no other status at all; any disagreements with the spec are bugs in html5lib. In particular testing implementations against html5lib output is not recommended.
There is also some work going on on making a fast implementation of the HTML5 spec parsing algorithm in C; I will be first in line to make Python bindings when that work is complete (and no doubt others will come up with other bindings too).