I'm off for a bit of vacation, but before then I thought I'd
throw up a bunch of links to stuff I've been doing in the last
couple months:
I've been doing a lot with HTML lately. HTML is great,
because it's important. (Hmm... this statement sounds kind
of weird. Because HTML is important, if you don't
understand why it is great you need to think harder. Now the
statement sounds weird, but in a deep way)
-
I've started working with microformats some. This
initially started with an
OLPC project, where I wrote an
hReview parser
(hReview
is a microformat).
-
Then related to that, I wanted to download a bunch of
pages in an isolated form, so all the internal links are
fixed up. This became
PageCollector. Since then I've added a script frontend. I think this
could be a generally handy export system. It doesn't
just make links relative, it also lets you completely
remap filenames and is fairly careful about links
(including links in CSS).
-
Doing all this HTML-related stuff, I've been using
lxml
a lot (and it should be inferred I therefor also like
libxml2
a lot). I like lxml a lot; probably the biggest reason
is that it has a fast permissive HTML parser. But the
ElementTree API that lxml uses can be a bit awkward for
HTML -- it works well for XML, but in HTML there's lots
of mixing of text and tags, and that feels really weird
in the ElementTree API. So I've started
lxml.html
in a branch with lots of feedback from Stefan Behnel.
This adds methods that are useful in HTML. With just a
few smallish methods, I think it makes HTML much easier
to work with in lxml.
-
Though
BeautifulSoup
is nice, it's not a C-based HTML parser. With
Deliverance
we are literally parsing and rewriting every page as it
comes through our site. I couldn't do this with
BeautifulSoup. And we're not doing any crappy regex
stuff; lxml lets us actually understand the HTML the way
the browser understands the HTML. That's important. This
is kind of the promise of XHTML -- the ability to create
smart and useful intermediaries -- that XHTML never
really fulfilled. The idea of smart intermediaries is a
powerful one, XHTML just wasn't the right basis;
intermediaries are supposed to work with
real endpoints, and those endpoints were and
continue to be HTML, not XHTML.
html5lib
is cool too as an HTML parser, and important as a
reference implementation, but I'll mostly be waiting for
that work to filter back into something like libxml2.
-
I should note that I think the stream model of handling
HTML (as
HTMLParser
uses) is lame. Typical HTML is not very big, and a
document model (where the whole document is loaded into
memory) is way easier to handle. HTML is best understood
as a document. (Genshi
would probably be rockin' fast if it used lxml)
-
As part of lxml.html I've been writing an
HTML cleaning module. It's a bit messy, and incomplete, but I think it will
be really useful. Because we fully parse and then
serialize the HTML, a lot of XSS attacks are avoided
really early. At work David Turner has been working on a
Transclusion
intermediary called
Transcluder. It's a little frightening to fold together disparate
pieces of HTML; this might make it less frightening.
-
I took the pattern from
formencode.doctest_xml_compare
and also ported it to lxml. Now you can do
semantic HTML and XML diffs in your doctests,
instead of literal string comparison. This means that
things like the order of your attributes are ignored, as
is whitespace. And the diffs help highlight problems in
a more fine-grained way. Using the html branch you can
do
from
lxml
import
usedoctest
or
from
lxml.html
import
usedoctest
to enable the comparison inside a doctest. And you don't
need a custom doctest runner. I managed to do this using
horrible horrible monkeypatching. Also from FormEncode
I'm hoping to port
htmlfill
over to lxml.html sometime too.
-
With Luke Tucker we wrote an
html diff module, which I've also put into lxml.html. This shows
textual differences while trying to avoid markup
differences (since it's really hard to visually
represent changes in markup, and also usually not what
people care about). This is surprisingly difficult
because you also need to preserve the markup --
you can't just smash all the words together. I think
it's a pretty good implementation, and there aren't that
many implementations of this out there. You can also do
blame style annotations, showing who added what over the
history of the page.
Stuff from
work
(which is a little vague, because work, private projects,
HTML, OLPC, all overlap a lot).
-
Deliverance
is deployed on the live site. We haven't really
used its functionality much yet. But it's
awesome functionality once we get around to using it.
-
Related to that deployment, I spent some time with
paste.httpserver
(trunk only) on the thread pool. It now adapts the
thread pool to add threads when necessary, kill threads
that seem to be wedged, and various tools around that.
This recipe
was handy, as now I can kill threads. It's surprisingly
slow, though -- it frequently takes several minutes for
the thread to be killed. But it shows up as a regular
exception, which is great. You can read about the
details in
this document.
-
I've been extracting some stuff into
WSGIProxy
for your non-browser-based proxying. This is mostly for
handling HTTP-based dispatch between server
applications.
-
Not really for work, but related to proxying I whipped
together a
little HTTP proxy
using Paste and WSGI middleware. It kind of works. It
was just a prototype, so I haven't worked on it more,
but with just a little more work I think it
should be easy to turn any WSGI middleware into an HTTP
proxy. This is the kind of proxy you configure in your
browser.
-
A followup to the
HTML WYSIWYG post, we decided to use
Xinha. I feel pretty good about the decision, which is
primarily motivated by Xinha's UI -- in small but
important ways it feels like the best UI among the
bunch. It's just more polished and higher quality than
the other options. It's also a Real Open Source Project,
which matters to us. But we haven't actually implemented
this yet.
-
I wrote a
total hack of doctest
that lets you do
from
dtopt
import
ELLIPSIS, and then the ELLIPSIS option (that lets you use
...
as a wildcard in output) will be enabled for the rest of
your doctest. It's ugly, but I might add a few more
similar hacks to that package and make it into something
more real. But still ugly.
-
With Whit Morris I've been working on a
Tagging application. It's just getting started, so nothing working yet. I
like the
atom module, which represents Atom feeds and entries as plain XML.
This didn't occur to me at first -- I was going to
create a Python representation of the Atom data model
and then add parse and serialize methods to it. This is
wrong, because the Atom data model is XML. If
you add a funky custom element to your entry and I
normalize it into oblivion, then I've made a crappy
library; by making the XML the one and only source of
information I avoid such things.
-
I wrote a
little proof of concept for web application help
called WebClippy
(svn
here). It's inspired by
Hackety Hack
(a neat Ruby educational environment) and how tutorials
are written there -- just a very simple bit of
persistent help. Writing it made me realize that
S5
should totally be rewritten, because Javascript-based
slideshow scripts are easy to write and S5 is annoying.
Anyway, I need to actually try to document something
with WebClippy, because my test slides are lame, and
I've decided realistic examples are essential to good
development.
Stuff from
One Laptop Per Child
(besides the HTML experiments)...
-
I went to Boston for OLPC, and we talked a lot about
Web-Based Annotation. We want there to be a way for children to add notes
(shared or private) to any web page. There seems to be
people from several directions working towards this,
outside of OLPC. Annotation is something that seems to
pop up every few years, then die down; maybe this time
we can actually make it work. Maybe Web 2.0 should just
be "good ideas that aren't new, but this time they
might actually work".
-
I didn't do any of it, but
PyXPCOM
is in the Sugar browser now (Sugar is the OLPC UI). This
excites me, except then I looked at the code and I
couldn't figure out how to get access to any interesting
XPCOM objects (the DOM in particular) -- XPCOM seems
very indirect. Oh well, at least now it's
possible and I just have to figure out the
details. I hope/want/am-optimistic that OLPC will have a
great browser experience. It doesn't yet.
-
We had a little OLPC sprint for
Chicago people interested in the project. Again getting a working Sugar environment was a bit
of a problem. Afterward I
played around with some VMWare techniques. These need some polish, but I think it's the right
direction and it's not that hard. If someone could get
this working in a scripted way, I'm sure it could be
added to the standard build process at OLPC.
-
I wrote
WaitForIt, a little WSGI middleware that you can put in front of
slow WSGI applications. If the application is too slow,
it gives the user a page asking them to wait. This is
particularly handy for things like administrative areas
where you realize you are doing some really slow
blocking operation and you don't want it to time out but
you really don't feel like setting up a fancy queue for
it. Just pop this in front and your problem is pretty
much solved. This is part of a secret framework I'm
developing.
-
I started
revisiting buildutils
and cleaning some stuff up. Mostly I've added some
commands to make it easier to add and make new commands
for
setup.py. This relates to
this post. I added a
command to WebHelpers
for Javascript compression.
-
It's a little old now, but I added a
templating language
to Paste. Using
string.Template
for the little internal templating things just wasn't
good. Cheetah has been too hard to install. So this is a
simple, single-file templating language based on string
substitution. I should probably extract it and do a few
more things with it. I feel a little dumb about writing
Yet Another Templating Language, but I couldn't find a
templating language that I liked for small text-related
tasks. For actual web programming a more complete
language is good.
-
I made a
little package called CmdUtils
to make it a bit easier to build command-line scripts,
with the patterns that I personally like.
-
I kind of got a
new encoding working. It encodes text to spaces, newlines, and tabs. Now
whitespace can be even more significant! You can also
put non-whitespace in the encoded text as a form of
steganography.
-
Among my
little recipes
I added
propertyclass
for making properties. I also added a
little script
for pasting text to a pastebin (only for X users). And
also one for creating
easy_install links from a subversion repository.
-
I wrote about
my desktop
for a new blog that Steve Holden created.
Hmm... more stuff than I thought. Working at
The Open Planning Project
is great, because I spend all my time working on open source
projects in public repositories. And also working on Stuff
That Interests Me. If you've gotten an email from me
recently, you may notice I link to
our careers page
in my sig, simply because I think everyone cool should work
at TOPP. We're doing other interesting things too that I'm
not directly involved with; I'll try to write about other
people's projects before too long.
After vacation I should probably start looking more into
getting WordPress on our site. TOPP is not technically
tribalist! We do accept PHP, even if we don't really
trust PHP. Once that's up, I'll move my own blog
from this crappy piece of spam ridden crap (that I wrote, so
no one but myself to blame), and then maybe I'll post more
because I'll be less embarrassed about the software.