Ian Bicking: a blog :: 2007

{ Monthly Archives }

September 2007

lxml.html

Over the summer I did quite a bit of work on lxml.html. I’m pretty excited about it, because with just a little work HTML starts to be very usefully manipulatable. This isn’t how I’ve felt about HTML in the past, with all HTML emerging from templates and consumed only by browsers.

The ElementTree representation (which lxml copies) is a bit of a nuisance when representing HTML. A few methods improve it, but it is still awkward for content with mixed tags and text (common in HTML, uncommon in most other XML). Looking at Genshi Transforms there are some things I wish we could do, like simply "unwrap" text and then wrap it again. But once you remove a tag the text is thoroughly merged into its neighbors. Another little nuisance is that el.text and el.tail can be None, which means you have to guard a lot of code.

That said, here’s the Genshi example:

>>> html = HTML('''<html>
... <head></head>
... <body>
... Some <em>body</em> text.
... </body>
... </html>''')
>>> print html | Transformer('body/em').map(unicode.upper, TEXT) \
... .unwrap().wrap(tag.u).end() \
... .select('body/u') \
... .prepend('underlined ')

Here’s how you’d do it with lxml.html:

>>> html = fromstring('''... same thing ...''')
>>> def transform(doc):
... for el in doc.xpath('body/em'):
... el.text = (el.text or '').upper()
... el.tag = 'u'
... for el in doc.xpath('body/u'):
... el.text = 'underlined ' + (el.text or '')

I’m not sure if Genshi works in-place here, or makes a copy; otherwise these are pretty much equivalent. Which is better? Personally I prefer mine, and actually prefer it quite strongly, because it’s quite simple — it’s a function with loops and assignments. It’s practically pedestrian in comparison to the Genshi example, which uses methods to declaratively create a transformer.

Some of the things now in lxml.html include:

Link handling, which is particularly focused on rewriting links so you can put HTML fragments into a new context without breaking the relative links.
Smart doctest comparisons (attribute-order-neutral comparisons, with improved diffs, and also whitespace neutral, based loosely on formencode.doctest_xml_compare). Inside your doctest choose XML parsing with from lxml import usedoctest or HTML parsing with from lxml.html import usedoctest. I consider the import trick My Worst Monkeypatch Ever, but it kind of reads nicely. For testing it is very nice.
Cleaning code, to avoid XSS attacks, in lxml.html.clean. This is still pretty messy, because there’s lots of little things you may or may not want to protect against. E.g., I think I can mostly clean out style tags (at least of Javascript), but some people might want to remove all style. So there’s an option. There’s lots of options. Too many.
With the cleaning code there’s word-wrapping code and autolinking code. I think of these as clean-up-people’s-scrappy-HTML tools. Also important for putting untrusted HTML in a new context.
I rewrote htmlfill in lxml.html.formfill. It’s a bit simpler, and keeps error messages separate from actual value filling. They were really only combined because I didn’t want to do two passes with HTMLParser for the two steps, but that doesn’t matter when you load the document into memory. I also stopped using markup like <form:error> for placing error messages; it’s all automatic now, which I suppose is both good and bad.
After I wrote lxml.html.formfill I got it into my head to make smarter forms more natively. So now you can do:

>>> from lxml.html import parse
>>> page = parse('http://tripsweb.rtachicago.com/').getroot()
>>> form = page.forms[0]
>>> from pprint import pprint
>>> pprint(form.form_values())
[('action', 'entry'),
('resptype', 'U'),
('Arr', 'D'),
('f_month', '09'),
('f_day', '21'),
('f_year', '2007'),
('f_hours', '9'),
('f_minutes', '30'),
('f_ampm', 'AM'),
('Atr', 'N'),
('walk', '0.9999'),
('Min', 'T'),
('mode', 'A')]
>>> for key in sorted(f.fields.keys()):
... print key
None
Arr
Atr
Dest
Min
Orig
action
dCity
endpoint
f_ampm
f_day
f_hours
f_minutes
f_month
f_year
mode
oCity
resptype
startpoint
walk
>>> f.fields['Orig'] = '1500 W Leland'
>>> f.fields['Dest'] = 'LINCOLN PARK ZOO'
>>> from lxml.html import submit_form()
>>> result = parse(submit_form(f)).getroot()

From there I’d have to actually scrape the results to figure out what the best trip was, which isn’t as easy.
HTML diffing and something like svn blame for a series of documents, in lxml.html.diff. Someone noted a similarity between htmldiff and templatemaker, and they are conceptually similar, but with very different purposes. htmldiff goes to great trouble to ignore markup and focus only on changes to textual content. As such it is great for a history page. templatemaker focuses on the dissection of computer-generated HTML and extracting its human-generated components. Templatemaker is focused on screen scraping. It might be handy in that form example above…
There’s also a fairly complete implementation of CSS 3 selectors. It would be interesting to mix this with cssutils.

Though some people aren’t so enthusiastic about CSS namespaces (and I can’t really blame him), conveniently this CSS 3 feature makes CSS selectors applicable to all XML. I don’t know if anyone is actually going to use them instead of XPath on non-HTML documents, but you could. Because the implementation just compiles CSS to XPath, you could potentially use this module with other XML libraries that know XPath. Of which I only actually know one (or two <http://genshi.edgewall.org/>?) — though compiling CSS to XPath, then having XPath parsed and interpreted in Python, is probably not a good idea. But if you are so inclined, there’s also a parser in there you could use.
lxml and BeautifulSoup are no longer exclusive choices: lxml.html.ElementSoup.parse() can parse pages with BeautifulSoup into lxml data structures. While the native lxml/libxml2 HTML parser works on pretty bad HTML, BeautifulSoup works on really bad HTML. It would be nice to have something similar with html5lib.

2007 09 24

HTML
Python

Comments (8)

Permalink

2 Python Environment Experiments

two experiments in the Python environment. The first is virtualenv, which is a rethinking of virtual-python.py, and my attempt to move away from workingenv. It works mostly like virtual-python.py, and on systems where it works (not Windows, nor Framework Mac Python) I think it works considerably better than workingenv. No more not a setuptools' site.py, in particular. The basic technique is that it creates/copies a new python executable, and anything that uses that executable (including a script that references that executable with #!) will use that environment.

On the systems where it doesn’t work, I’m not quite sure what to do. The problem with the Mac is that sys.prefix is not determined by the location of the python executable, it’s hard-coded in some fashion. I asked about it on distutils-sig and got some response, but haven’t figured out any solution yet.

On Windows similarly sys.prefix is not determined by the executable location. What it’s determined by there I don’t know — the location of python25.dll, something in the registry? If I could figure it out, perhaps this could work there too — the existance of symlinks isn’t as important as it was with virtual-python.py.

If I can get these figured out, I think this will be a much happier experience than workingenv, and a somewhat friendlier experience than virtual-python.py. On non-Mac posix systems it works well right now.

The other experiment is in buildutils (downloadable with Mercurial): a new command python setup.py bundle, run in the application package you want to bundle. This creates a directory with all the dependencies of the application, and scripts that load up the appropriate dependencies. You can then ship the entire thing in a zip file as a runnable application that doesn’t require any installation except for unpacking.

Actually creating the bundle can be a little finicky, because easy_install has a tendency to prefer things on the local machine even though it shouldn’t. Probably it would be best to run this inside a virtualenv; when you are done you can also feel more confident that you’ve actually included all the dependencies (if you use --no-site-packages when creating the virtualenv).

Anyway, while both of these are a little incomplete I’m feeling optimistic about them, and I’m hoping intrepid souls can give feedback on how they work.

Update: virtualenv 0.8.2 is out, featuring Better Error Messages (and nothing else). Still doesn’t work on Mac Framework Pythons, or Windows. You’ll have to keep using workingenv there — but patches extremely welcome! Contact me if you are interested in supporting these platforms. It will involve some digging, but maybe we can just do the digging once for everyone.

Update 2: virtualenv 0.8.3 is out, featuring Windows!

2007 09 16

Programming
Python

Comments (10)

Permalink

FlatAtompub

A little while ago I decided to whip up a small Atompub server to get my head around the Atom Publishing Protocol. I called it FlatAtomPub because it was just storing stuff in flat files. I’m not committing to that name. It was also a chance to kick the tires on WebOb.

What I take out of the process:

APE was very handy. I lazily did very little unit testing, instead relying on APE to do the work for me. This seemed to work quite well. It is fun doing test driven development when someone else develops the tests.
I wrote a little decorator that serves as a kind of framework. It worked pretty well, I think. This might be the prototype of what the PylonsController.__call__ method does in some WebOb-using future.
Stuff like conditional requests and responses are mostly implemented in WebOb itself, which works well. HTTP is clear about how conditional requests work, so if you can just setup the basic info (ETag, Last-Modified) you can let the library handle the rest. I could probably save a little work by paying closer attention to ETags and Last-Modified up-front, but since there’s no complicated template rendering the work saved doesn’t seem significant.
The atom library removed most concern about the XML itself.
I don’t know what to with collections. I guess I could allow multiple collections via configuration. If the store wasn’t a dumb store (e.g., it was plugged directly into a blog, it didn’t just passively store things) it would be clearer what a collection would mean. As it is, collections are just a way of aggregating multiple Atompub servers into one service document, which doesn’t seem very useful.
Handling links and slugs is kind of annoying. I took the lazy way out for this, using relative links and treating the slug and link as the same thing. This isn’t a good long-term solution, as it can mess things up if you start handing entries around, or worse move a server, and I don’t even set xml:base on elements. In theory it should all work, but it makes the client do more effort than I would like. On the other hand, I suppose a client should do that extra work anyway, as it shouldn’t rely on the server not being lazy. So maybe I’m better off sticking with a lazy solution, and making sure I work with non-lazy clients.
I considered pluggable storage, but ultimately decided it didn’t matter. Storing entries in files is fine; files are easy, and they work. I put in pluggable indexing instead. Amplee is another Python Atompub server, and I looked at Amplee storage backends. It’s kind of clever to have things like svn or S3 backends. But I’m not sure what use I’d actually do with that.
I haven’t yet considered transactions; if something fails part way through, stuff will be inconsistent. Admittedly this is where files make things harder. Probably a clear way to re-index would be useful too, as at least there’s a clear location for the canonical data (the files).
The dependencies are still a little tangled; even though the library doesn’t use a great deal of stuff, there’s enough pieces that some people have had a hard time getting it setup.
Ignoring authentication is nice. I should see what it takes to setup some authentication, but implementing it directly is out of scope for FlatAtomPub.

Interested people can look at the svn repository.

This makes me wonder how hard WebDAV would be…

2007 09 12

Programming
Python
Web

Comments (4)

Permalink

Re-raising Exceptions

After reading Chris McDonough’s What Not To Do When Writing Python Software, it occurred to me that many people don’t actually know how to properly re-raise exceptions. So a little mini-tutorial for Python programmers, about exceptions…

First, this is bad:

try:
some_code()
except:
revert_stuff()
raise Exception("some_code failed!")

It is bad because all the information about how some_code() failed is lost. The traceback, the error message itself. Maybe it was an expected error, maybe it wasn’t.

Here’s a modest improvement (but still not very good):

try:
some_code()
except:
import traceback
traceback.print_exc()
revert_stuff()
raise Exception("some_code failed!")

traceback.print_exc() prints the original traceback to stderr. Sometimes that’s the best you can do, because you really want to recover from an unexpected error. But if you aren’t recovering, this is what you should do:

try:
some_code()
except:
revert_stuff()
raise

Using raise with no arguments re-raises the last exception. Sometimes people give a blank never use “except:“ statement, but this particular form (except: + raise) is okay.

There’s another form of raise that not many people know about, but can also be handy. Like raise with no arguments, it can be used to keep the traceback:

try:
some_code()
except:
import sys
exc_info = sys.exc_info()
maybe_raise(exc_info)

def maybe_raise(exc_info):
if for some reason this seems like it should be raised:
raise exc_info[0], exc_info[1], exc_info[2]

This can be handy if you need to handle the exception in some different part of the code from where the exception happened. But usually it’s not that handy; it’s an obscure feature for a reason.

Another case when people often clobber the traceback is when they want to add information to it, e.g.:

for lineno, line in enumerate(file):
try:
process_line(line)
except Exception, exc:
raise Exception("Error in line %s: %s" % (lineno, exc))

You keep the error message here, but lose the traceback. There’s a couple ways to keep that traceback. One I sometimes use is to retain the exception, but change the message:

except Exception, exc:
args = exc.args
if not args:
arg0 = ''
else:
arg0 = args[0]
arg0 += ' at line %s' % lineno
exc.args = arg0 + args[1:]
raise

It’s a little awkward. Technically (though it’s deprecated) you can raise anything as an exception. If you use except Exception: you won’t catch things like string exceptions or other weird types. It’s up to you to decide if you care about these cases; I generally ignore them. It’s also possible that an exception won’t have .args, or the string message for the exception won’t be derived from those arguments, or that it will be formatted in a funny way
(KeyError formats its message differently, for instance). So this isn’t foolproof. To be a bit more robust, you can get the exception like this:

except:
exc_class, exc, tb = sys.exc_info()

exc_class will be a string, if someone does something like raise "not found". There’s a reason why that style is deprecated. Anyway, if you really want to mess around with things, you can then do:

new_exc = Exception("Error in line %s: %s"
% (lineno, exc or exc_class))
raise new_exc.__class__, new_exc, tb

The confusing part is that you’ve changed the exception class around, but you have at least kept the traceback intact. It can look a little odd to see raise ValueError(...) in the traceback, and Exception in the error message.

Anyway, a quick summary of proper ways to re-raise exceptions in Python. May your tracebacks prosper!

Update: Kumar notes the problem of errors in your error handler. Things get more long winded, but here’s the simplest way I know of to deal with that:

try:
code()
except:
exc_info = sys.exc_info()
try:
revert_stuff()
except:
# If this happens, it clobbers exc_info, which is why we had
# to save it above
import traceback
print >> sys.stderr, "Error in revert_stuff():"
traceback.print_exc()
raise exc_info[0], exc_info[1], exc_info[2]

2007 09 12

Python

Comments (23)

Permalink

9/11/2007

So, today is 9/11. I almost missed it. It’s not like it catches you by surprise, you’re not going to forget the date. But it’s just been slipping by for a few years now without much notice.

As an event it is still very important. History flowed from that day. But it doesn’t mean anything anymore.

Remember how everyone was saying, on those days after 9/11/2001, that they thought about life differently, about the things that really mattered and the things that didn’t? A couple years ago I felt frustrated by how quickly that seemed to disappear, how quickly genuine sentiment turned into empty rhetoric. A few years ago that transition was frustrating, now the whole thing seems laughable. The death of irony? No… after 9/11 our modern cynicism was down but it wasn’t out. It came back fighting, and a National Sense Of Grief was no match.

Whatever. I’m tired of it anyway. You win Whatever, you’re the champ.

2007 09 11

Non-technical
Politics

Comments (9)

Permalink

Ian Bicking: a blog

September 2007

lxml.html

2007 09 24

2 Python Environment Experiments

2007 09 16

FlatAtompub

2007 09 12

Re-raising Exceptions

2007 09 12

9/11/2007

2007 09 11

Home

About

Archives

Categories

Recent Posts

Recent Comments