Ian Bicking: a blog

lxml.html

Over the summer I did quite a bit of work on lxml.html. I’m pretty excited about it, because with just a little work HTML starts to be very usefully manipulatable. This isn’t how I’ve felt about HTML in the past, with all HTML emerging from templates and consumed only by browsers.

The ElementTree representation (which lxml copies) is a bit of a nuisance when representing HTML. A few methods improve it, but it is still awkward for content with mixed tags and text (common in HTML, uncommon in most other XML). Looking at Genshi Transforms there are some things I wish we could do, like simply "unwrap" text and then wrap it again. But once you remove a tag the text is thoroughly merged into its neighbors. Another little nuisance is that el.text and el.tail can be None, which means you have to guard a lot of code.

That said, here’s the Genshi example:

>>> html = HTML('''<html>
... <head></head>
... <body>
... Some <em>body</em> text.
... </body>
... </html>''')
>>> print html | Transformer('body/em').map(unicode.upper, TEXT) \
... .unwrap().wrap(tag.u).end() \
... .select('body/u') \
... .prepend('underlined ')

Here’s how you’d do it with lxml.html:

>>> html = fromstring('''... same thing ...''')
>>> def transform(doc):
... for el in doc.xpath('body/em'):
... el.text = (el.text or '').upper()
... el.tag = 'u'
... for el in doc.xpath('body/u'):
... el.text = 'underlined ' + (el.text or '')

I’m not sure if Genshi works in-place here, or makes a copy; otherwise these are pretty much equivalent. Which is better? Personally I prefer mine, and actually prefer it quite strongly, because it’s quite simple — it’s a function with loops and assignments. It’s practically pedestrian in comparison to the Genshi example, which uses methods to declaratively create a transformer.

Some of the things now in lxml.html include:

Link handling, which is particularly focused on rewriting links so you can put HTML fragments into a new context without breaking the relative links.
Smart doctest comparisons (attribute-order-neutral comparisons, with improved diffs, and also whitespace neutral, based loosely on formencode.doctest_xml_compare). Inside your doctest choose XML parsing with from lxml import usedoctest or HTML parsing with from lxml.html import usedoctest. I consider the import trick My Worst Monkeypatch Ever, but it kind of reads nicely. For testing it is very nice.
Cleaning code, to avoid XSS attacks, in lxml.html.clean. This is still pretty messy, because there’s lots of little things you may or may not want to protect against. E.g., I think I can mostly clean out style tags (at least of Javascript), but some people might want to remove all style. So there’s an option. There’s lots of options. Too many.
With the cleaning code there’s word-wrapping code and autolinking code. I think of these as clean-up-people’s-scrappy-HTML tools. Also important for putting untrusted HTML in a new context.
I rewrote htmlfill in lxml.html.formfill. It’s a bit simpler, and keeps error messages separate from actual value filling. They were really only combined because I didn’t want to do two passes with HTMLParser for the two steps, but that doesn’t matter when you load the document into memory. I also stopped using markup like <form:error> for placing error messages; it’s all automatic now, which I suppose is both good and bad.
After I wrote lxml.html.formfill I got it into my head to make smarter forms more natively. So now you can do:

>>> from lxml.html import parse
>>> page = parse('http://tripsweb.rtachicago.com/').getroot()
>>> form = page.forms[0]
>>> from pprint import pprint
>>> pprint(form.form_values())
[('action', 'entry'),
('resptype', 'U'),
('Arr', 'D'),
('f_month', '09'),
('f_day', '21'),
('f_year', '2007'),
('f_hours', '9'),
('f_minutes', '30'),
('f_ampm', 'AM'),
('Atr', 'N'),
('walk', '0.9999'),
('Min', 'T'),
('mode', 'A')]
>>> for key in sorted(f.fields.keys()):
... print key
None
Arr
Atr
Dest
Min
Orig
action
dCity
endpoint
f_ampm
f_day
f_hours
f_minutes
f_month
f_year
mode
oCity
resptype
startpoint
walk
>>> f.fields['Orig'] = '1500 W Leland'
>>> f.fields['Dest'] = 'LINCOLN PARK ZOO'
>>> from lxml.html import submit_form()
>>> result = parse(submit_form(f)).getroot()

From there I’d have to actually scrape the results to figure out what the best trip was, which isn’t as easy.
HTML diffing and something like svn blame for a series of documents, in lxml.html.diff. Someone noted a similarity between htmldiff and templatemaker, and they are conceptually similar, but with very different purposes. htmldiff goes to great trouble to ignore markup and focus only on changes to textual content. As such it is great for a history page. templatemaker focuses on the dissection of computer-generated HTML and extracting its human-generated components. Templatemaker is focused on screen scraping. It might be handy in that form example above…
There’s also a fairly complete implementation of CSS 3 selectors. It would be interesting to mix this with cssutils.

Though some people aren’t so enthusiastic about CSS namespaces (and I can’t really blame him), conveniently this CSS 3 feature makes CSS selectors applicable to all XML. I don’t know if anyone is actually going to use them instead of XPath on non-HTML documents, but you could. Because the implementation just compiles CSS to XPath, you could potentially use this module with other XML libraries that know XPath. Of which I only actually know one (or two <http://genshi.edgewall.org/>?) — though compiling CSS to XPath, then having XPath parsed and interpreted in Python, is probably not a good idea. But if you are so inclined, there’s also a parser in there you could use.
lxml and BeautifulSoup are no longer exclusive choices: lxml.html.ElementSoup.parse() can parse pages with BeautifulSoup into lxml data structures. While the native lxml/libxml2 HTML parser works on pretty bad HTML, BeautifulSoup works on really bad HTML. It would be nice to have something similar with html5lib.

Reflection and Description Of Meaning

After writing my last post I thought I might follow up with a bit of cognitive speculation. Since the first comment was exactly about the issue I was thinking about writing on, I might as well follow up quickly.

Jeff Snell replied:

You parse semantic markup in rich text all the time. When formatting changes, you apply a reason. RFC’s don’t capitalize MUST and SHOULD because the author is thinking in upper-case versus lower-case. They’re putting a strong emphasis on those words. As a reader, you take special notice of those words being formatted that way and immediately recognize that they contain a special importance. So I think that readers do parse writing into semantic markup inside their brains.

Emphasis not added. Wait, bold isn’t emphasis, it’s strong! So sorry, STRONG not added.

I think the reasoning here is flawed, in that it supposes that reflection on how we think is an accurate way of describing how we think.

A few years ago I got interested in cognition for a while and particularly some of the new theories on consciousness. One of the parts that really stuck with me was the difference in how we think about thinking, and how thinking really works (as revealed with timing experiments). That is, our conscious thought (the thinking-about-thinking) happened after the actual thought; we make up reasons for our actions when we’re challenged, but if we aren’t challenged to explain our actions there’s no consciousness at all (of course, you can challenge yourself to explain your reasoning — but you usually won’t). And then we revise history so that our reasoning precedes our decision, but that’s not always very accurate. This gets around the infinite-loop problem, where either there’s always another level of meta-consciousness reasoning about the lower level of consciousness, or there’s a potentially infinite sequence of whys that have to be answered for every decision. And of course sometimes we really do make rational decisions and there are several levels of why answered before we commit. But this is not the most common case, and there’s always a limit to how much reflection we can do. There are always decisions made without conscious consideration — if only to free ourselves to focus on the important decisions.

And so as both a reader and a writer, I think in terms of italic and bold. As a reader and a writer there is of course translation from one form to another. There’s some idea inside of me that I want to get out in my writing, there’s some idea outside of me that I want to understand as a reader. But just because I can describe some intermediate form of semantic meaning, it doesn’t mean that that meaning is actually there. Instead I invent things like "strong" and "emphasis" when I’m asked to decide why I chose a particular text style. But the real decision is intuitive — I map directly from my ideas to words on the page, or vice versa for reading.

Obviously this is not true for all markup. But my intuition as both a reader and a writer about bold and italic is strong enough that I feel confident there’s no intermediary representation. This is not unlike the fact I don’t consider the phonetics of most words (though admittedly I did when trying to spell "phonetics"); common words are opaque tokens that I read in their entirety without consideration of their component letters. And a good reader reads text words without consideration of their vocal equivalents (though as a writer I read my own writing out loud… is that typical? I’m guessing it is). A good reader can of course vocalize if asked, but that doesn’t mean the vocalization is an accurate representation of their original reading experience.

Though it’s kind of an aside, I think the use of MUST and SHOULD in RFCs fits with this theory. By using all caps they emphasize the word over the prose, they make the reader see the words as tokens unique from "must" and "should", with special meanings that are related to but also much more strict than their usual English meaning. The caps are a way of disturbing our natural way of determining meaning because they need a more exact language.

Ian Bicking: a blog

HTML

lxml.html

2007 09 24

Reflection and Description Of Meaning

2007 08 14

Home

About

Archives

Categories

Recent Posts

Recent Comments