Ian Bicking: a blog :: 2008

Environmental Guilt

I was offhandedly reading this post, which talked about Earth Hour, and about hating on SUVs:

Also thinking of a nice, simple mass-action for discouraging the SUV-ites. Simple, direct; when you see someone driving an SUV, slowly shake your head in disappointment and disgust at the stupidity of the driver. Throw in a disgusted sneer and snort if you like. It’s not necessarily the driver that you’re targeting, the people around you are probably more likely to affect purchasing decisions.

Well, a rather pedestrian level of hate as environmental discussions go.

I hate SUVs too. There’s a very small number of people who have good reason to own an SUV. Everyone else should own a normal car or a minivan (more practical in all the ways that matter, it just doesn’t look as cool). OK, the irony is that the minivan isn’t going to be much more efficient, it’s just that I’ll trust you have good reason to own a big vehicle, because you’ll have weighed the utility against the supreme uncoolness of a minivan. And anyway, four people barreling down the highway in a minivan is more efficient than one person in a Prius. With an SUV I’ll always suspect vanity. And what’s worse, I won’t think less of you just because of the resources you take up (not just carbon, but street space, visual, impact, etc)… I’ll also think less of you because I’ll have you pegged as a dumb consumer. And don’t give me any bullshit about getting around in the snow — then I’ll just peg you as a lousy driver, because I’ve been driving out of snow drifts in crappy low-clearance underpowered cars all my life without much trouble.

But I digress. Yeah, SUVs are shit. So what does it matter if I think so? I can only not buy an SUV so many times. If I don’t buy a million SUVs will I have saved the world? No. So, like Mike I wish I could get other people not to buy SUVs. I’ve considered tagging SUVs with these bumper stickers, but I dunno. Will I do anything more than piss some people off? If I make some soccer-mom type feel guilty, will I have actually accomplished anything? I think she’s a stupid consumer, and probably self-centered in her choice, but do I actually want that person to feel bad, or mad, or unjustly accused? The only outcome I can think of is some negative reaction, and maybe that reaction could be productive. But probably not.

The idea really fell apart as I reflected on all the Hispanic people in their SUVs going to the Catholic church next door, and realized that if I tagged one of their SUVs it would probably be even more pointless. This was their symbol of success, and you certainly can’t fault them for getting a big vehicle if they are filling it up, even if their particular choice of SUV was just a reflection of cowboy dreams — but when they bought the SUV instead of the minivan they only wasted some money, they didn’t really do any worse for the world.

But getting beyond the particulars of SUVs, I feel environmentalism has a real problem. It is built on guilt. A NIMBY action, or maybe land conservation, can actually be explained as rational direct action. Personal effort can result in the improvement of your personal space. But global warming? Personal action doesn’t do anything, it can’t do anything. All we have is guilt, a sense of collective responsibility, fear over some collective doom.

Guilt is a crappy foundation for a movement. One thing our commercial and consumerist world has going for it: there’s no guilt. The salesman won’t question why you are buying something. It’s always "thank you sir, have a nice day!" And even though sophisticated people will mock the insincerity of the expression, we’re still human and a kind word and a smile still makes us feel better, no matter how our rational mind rejects it.

But environmentalism? The most common reactions to guilt are avoidance, procrastination, resentment. Guilt is a horrible way to achieve action. Judgment can be a way to build group identity, and environmentalism has achieved this. It means something to be an "environmentalist". But that’s hardly the goal, is it?

People want to do the right thing for the world. They want to stop global warming, they want to reduce pollution, save wildlife, all that stuff. All the surveys show this. We’re not going to get any closer to consensus (on goals) than we are already. If we, collectively and individually, are still not doing what we need to, then it is not for lack of a collective desire, or even a lack of education.

So how do we turn desire into action? I don’t think guilt is a good way to do it. I’m not sure I like that path anyway. Is it an irrational reaction to guilt that we try to avoid judgment? Is it irrational that people are drawn to an environment where they are told they are good, where they are accepted, where they can act to achieve clear goals (even if that is just a purchase), where they can succeed? Consumerism may only draw people to an unimpressive local maximum of happiness, but it always makes the pursuit clear, consumerism draws you forward, consumerism offers a clear path.

And even if you choose to accept and respond to the guilt of environmentalism, it won’t stop. First you turn the water off while you are brushing your teeth. Then you get rid of the SUV. You replace your bulbs with CFLs. Are you ready to get rid of your drier? Put your thermostat at 60F? Eat organic? Stop eating meat? Join or start a co-op? Get a composting toilet? Go off the grid? There’s always more to be done, there’s always another thing to feel guilty about not doing. It’s disheartening.

Is there a way environmentalism can be less depressing? Less guilt-driven? Less accusing and judgmental? Can environmentalism be less dismal, more happy? Environmentalism is trying to drive a wedge between what people want and what they do. Putting aside moral arguments, is this an effective way to make change?

Considering my carbon footprint has only made this worse. Every action is negative. Everything I do has a cost. Pursuing carbon neutrality feels like a pursuit of non-existence. People are questioning the growth imperative, but at least growth has a certain excitement to it. Do we step into the future with confidence or fear? Do we take each step with trepidation and dread? What a horrible way to come into contact with our future selves! I want to meet all of our future selves with arms open. Buying shit is a poor substitute for that optimism. But dammit, I want to be optimistic. I don’t want to just be guilty.

2008 03 31

Non-technical
Politics

Comments (33)

Permalink

Python HTML Parser Performance

In preparation for my PyCon talk on HTML I thought I’d do a performance comparison of several parsers and document models.

The situation is a little complex because there’s different steps in handling HTML:

Parse the HTML
Parse it into something (a document object)
Serialize it

Some libraries handle 1, some handle 2, some handle 1, 2, 3, etc. For instance, ElementSoup uses ElementTree as a document, but BeautifulSoup as the parser. BeautifulSoup itself has a document object included. HTMLParser only parses, while html5lib includes tree builders for several kinds of trees. There is also XML and HTML serialization.

So I’ve taken several combinations and made benchmarks. The combinations are:

lxml: a parser, document, and HTML serializer. Also can use BeautifulSoup and html5lib for parsing.
BeautifulSoup: a parser, document, and HTML serializer.
html5lib: a parser. It has a serializer, but I didn’t use it. It has a built-in document object (simpletree), but I don’t think it’s meant for much more than self-testing.
ElementTree: a document object, and XML serializer (I think newer versions might include an HTML serializer, but I didn’t use it). It doesn’t have a parser, but I used html5lib to parse to it. (I didn’t use the ElementSoup.)
cElementTree: a document object implemented as a C extension. I didn’t find any serializer.
HTMLParser: a parser. It didn’t parse to anything. It also doesn’t parse lots of normal (but maybe invalid) HTML. When using it, I just ran documents through the parser, not constructing any tree.
htmlfill: this library uses HTMLParser, but at least pays a little attention to the elements as they are parsed.
Genshi: includes a parser, document, and HTML serializer.
xml.dom.minidom: a document model built into the standard library, which html5lib can parse to. (I do not recommend using minidom for anything — some reasons will become apparent in this post, but there are many other reasons not covered why you shouldn’t use it.)

I expected lxml to perform well, as it is based on the C library libxml2. But it performed better than I realized, far better than any other library. As a result, if it wasn’t for some persistent installation problems (especially on Macs) I would recommend lxml for just about any HTML task.

You can try the code out here. I’ve included all the sample data, and the commands I ran for these graphs are here. These tests use a fairly random selection of HTML files (355 total) taken from python.org.

Parsing

lxml:0.6; BeautifulSoup:10.6; html5lib ElementTree:30.2; html5lib minidom:35.2; Genshi:7.3; HTMLParser:2.9; htmlfill:4.5

The first test parses the documents. Things to note: lxml is 6x faster than even HTMLParser, even though HTMLParser isn’t doing anything (lxml is building a tree in memory). I didn’t include all the things html5lib can parse to, because they all take about the same amount of time. xml.dom.minidom is only included because it is so noticeably slow. Genshi is fairly fast, but it’s the most fragile of the parsers. html5lib, lxml, and BeautifulSoup are all fairly similarly robust. html5lib has the benefit of (at least in theory) being the correct parsing of HTML.

While I don’t really believe it matters often, lxml releases the GIL during parsing.

Serialization

lxml:0.3; BeautifulSoup:2.0; html5lib ElementTree:1.9; html5lib minidom:3.8; Genshi:4.4

Serialization is pretty fast across all the libraries, though again lxml leads the pack by a long distance. ElementTree and minidom are only doing XML serialization, but there’s no reason that the HTML equivalent would be any faster. That Genshi is slower than minidom is surprising. That anything is worse than minidom is generally surprising.

Memory

lxml:26; BeautifulSoup:82; BeautifulSoup lxml:104; html5lib cElementTree:54; html5lib ElementTree:64; html5lib simpletree:98; html5lib minidom:192; Genshi:64; htmlfill:5.5; HTMLParser:4.4

The last test is of memory. I don’t have a lot of confidence in the way I made this test, but I’m sure it means something. This was done by parsing all the documents and holding the documents in memory, and using the RSS size reported by ps to see how much the process had grown. All the libraries should be imported when calculating the baseline, so only the documents and parsing should cause the memory increase.

HTMLParser is a baseline, as it just keeps the documents in memory as a string, and creates some intermediate strings. The intermediate strings don’t end up accounting for anything, since the memory used is almost exactly the combined size of all the files.

A tricky part of this measurement is that the Python allocator doesn’t let go of memory that it requests, so if a parser creates lots of intermediate strings and then releases them the process will still hang onto all that memory. To detect this I tried allocating new strings until the process size grew (trying to detect allocated but unused memory), but this didn’t reveal much — only the BeautifulSoup parser, serialized to an lxml tree, showed much extra memory.

This is one of the only places where html5lib with cElementTree was noticeably different than html5lib with ElementTree. Not that surprising, I guess, since I didn’t find a coded-in-C serializer, and I imagine the tree building is only going to be a lot faster for cElementTree if you are building the tree from C code (as its native XML parser would do).

lxml is probably memory efficient because it uses native libxml2 data structures, and only creates Python objects on demand.

In Conclusion

I knew lxml was fast before I started these benchmarks, but I didn’t expect it to be quite this fast.

So in conclusion: lxml kicks ass. You can use it in ways you couldn’t use other systems. You can parse, serialize, parse, serialize, and repeat the process a couple times with your HTML before the performance will hurt you. With high-level constructs many constructs can happen in very fast C code without calling out to Python. As an example, if you do an XPath query, the query string is compiled into something native and traverses the native libxml2 objects, only creating Python objects to wrap the query results. In addition, things like the modest memory use make me more confident that lxml will act reliably even under unexpected load.

I also am more confident about using a document model instead of stream parsing. It is sometimes felt that streamed parsing is better: you don’t keep the entire document in memory, and your work generally scales linearly with your document size. HTMLParser is a stream-based parser, emitting events for each kind of token (open tag, close tag, data, etc). Genshi also uses this model, with higher-level stuff like filters to make it feel a bit more natural. But the stream model is not the natural way to process a document, it’s actually a really awkward way to handle a document that is better seen as a single thing. If you are processing gigabyte files of XML it can make sense (and both the normally document-oriented lxml and ElementTree offer options when this happens). This doesn’t make any sense for HTML. And these tests make me believe that even really big HTML documents can be handled quite well by lxml, so a huge outlying document won’t break a system that is appropriately optimized for handling normal sized documents.

2008 03 30

HTML
Programming
Python

Comments (39)

Permalink

HTML Accessibility

So I gave a presentation at PyCon about HTML, which I ended up turning into an XML-sucks HTML-rocks talk. Well that’s a trivialization, but I have the privilege of trivializing my arguments all I want.

Somewhat to my surprise this got me a heckler (of sorts). I think it came up when I was making my lies and is truth argument. That is, presentation and intention are the same. There are those people who feel they can separate the two, creating semantic markup that represents their intent, but they are so few that the reader can never trust that the distinction is intentional, and so  and  must be treated as equivalent.

Someone then yelled out something like "what about blind people?" The argument being that screen readers would like to distinguish between the two, as not all things we render as italic would be read with emphasis.

It’s not surprising to me that the first time I’ve gotten an actively negative reaction to a talk it was about accessibility. When having technical discussions it’s hard to get that heated up. Is Python or Ruby better? We can talk shit on the web, where all emotions get mixed up and weirded, but in person these discussions tend to be quite calm and reasonable.

Discussions about accessibility, however, have strong moral undertones. This isn’t just What Tool Is Right For The Job. There is a kind of moral certainty to the argument that we should be making a world that is accessible to all people.

I fear this moral certainty has led people self-righteously down unwise paths. They believe — with of course some justification — that the world must be made right. And so many boil-the-ocean proposals are made, and even become codified by standards, but markup standards are useless unless embodied in actual content, and this is where accessibility falls down.

There are two posts that together have greatly eroded my trust in accessibility advocates, so that I feel like I am left adrift, unwilling to jump through the hoops accessibility advocates put up as I strongly suspect they are pointless.

The first post is about the longdesc attribute, an obscure attribute intended to tell the story of a picture. Where alt is typically used as a placeholder for the image, and a short description, longdesc can point to a document that describes the image in length. Empirically they (Ian Hickson in particular) found that the attribute was almost never used in a useful or correct way, rendering it effectively useless. If the discussion had clearly ended at this point, I would have deducted points for those people use advocated longdesc based on bad judgement, but it would not have effected my trust because anyone can mispredict. But the comments just seemed to reinforce the belief that because it should work, that it would work.

The second post was Ian Hickson’s description of using a popular screen reader (JAWS) — you’ll have to dig into the article some, as it’s embedded in other wandering thoughts. In summary, JAWS is a horrible experience, and as an example it didn’t even understand paragraph breaks (where the reader would be expected to pause). What’s the point of semantic markup for accessibility when the most basic markup that is both presentation and semantic () is ignored? Ian’s brief summary is that if you want to make your page readable in JAWS you’d do better by paying attention to punctuation (which does get read) than to markup. And if you want to help improve accessibility, blind people need a screen reader that isn’t crap.

Months later we started talking a bit about the accessibility of openplans.org. Everyone wants to do the right thing, no? With my trust eroded, I argued strongly that we should only implement accessibility empirically, not based on "best practices". Well, barring some patterns that seem very logical to me, like putting navigation textually at the bottom of the page, and other stuff that any self-respecting web developer does these days. But except for that, if we want to really consider accessibility we should get a tool and use it. But I don’t really know what that tool should be; JAWS is around $1000, all for what sounds like a piece of crap product. We could buy that, even though of course most web developers couldn’t possibly justify the purchase. But is that really the right choice? I don’t know. If we could detect something in the User-Agent string we could see what our users actually use. But I don’t think there’s information there. And I don’t know what people are using. Optimizing for screen magnifiers is much different that optimizing for screen readers.

Another shortcut for accessibility — a shortcut I also distrust — is that to make a site accessible you make sure it works without Javascript. But don’t many screen readers work directly off browsers? Browsers implement Javascript. Do blind users turn Javascript off? I don’t know. If you use no-Javascript as a hint to make the site more accessible, you might just be wasting your effort.

There’s also some weird perspective problems with accessibility. Blind users will always be a small portion of the population. It’s just unreasonable to expect sighted users to write to this small population. Relying on hidden hints in content to provide accessibility just can’t work. Hidden content will be broken, only visible content can be trusted. Admitting this does not mean giving up. As a sighted reader I do not expect the written and spoken word to be equivalent. I don’t think blind listeners lose anything by hearing something that is more a dialect specific to the computer translation of written text to spoken text. (Maybe treating text-to-speech as a translation effort would be more successful anyway?)

A freely available screen reader probably would help a lot as well. I write my markup to render in browsers, not to render to a spec. Anything else is just bad practice. I can’t seriously write my markup for readers based on a spec.

2008 03 23

HTML
Programming
Web

Comments (9)

Permalink

Monkeypatching and dead ends

Bill de hÓra and then Patrick Logan picked up on an old post of mine about monkeypatching.

Patrick’s reply:

I know next to nothing about the specific problems the Ruby and Python folks are encountering with "monkeypatching". However this capability is nothing new for dynamic languages. And it is a frequent desire for me when I program in C-like languages. If you become frustrated using static "utility" methods, for example in Java, that work with "closed" classes (say, String or Object), then you have at least some desire for these "monkeypatches".

See the thing is this capability is a cool feature in many Lisp and most Smalltalk systems. Sorry, dear readers who hate my Smug Lisp Weeniness. But it is true. Not only is it "cool," moreover it is pragmatic.

The truly good implementations of dynamic languages recognize the advantages of these kinds of extensions, and they’ve supported them with good tools for decades. Learn from it, don’t run from it.

Sharp tools are good. I would not want monkeypatching removed from Python. Still, it’s best not to leave sharp tools lying around. It’s best not to mix your butter knives with your steak knives. I don’t resent the safety guards on circular saws.

And sorry Lisp Weenies: your experiences are not so novel anymore. The Python community isn’t new to this dynamic typing thing. We’ve taken some hits and we’ve learned from them. And frankly the problems with runtime patching of methods can’t be specific to Python or Ruby. It only took the Ruby community a couple years to start catching on. Are you telling me Lisp and Smalltalk programmers still haven’t figured this out? Everything you value about modularity is at risk when you monkeypatch code. That risk can be worth it, of course! But do you really need me to explain the benefits of modularity? What’s next, a recap of the problems with GOTO?

One of the things that I think distinguishes Python among the popular dynamically typed languages of the day, is that it’s built — languages and libraries — on a great deal of concrete experience. Experience about developing with Python. There was a time when people tended to define Python as a delta from Java or Perl or C. We don’t need to do that anymore. Sure, closed classes in Java suck. Python isn’t a reaction to Java’s suckiness. That we can do something Java can’t doesn’t get me excited. This feature of monkeypatching has to stand up on its own, and while sometimes its use is justified those cases are few and far between. That’s what we’ve learned: monkeypatching was not dismissed out of hand, it was not dismissed because of anything in Java, it was dismissed because people used it without acknowledging it as a hack, and it sucked.

Of course the use cases are still there. Which is why people are trying new things to address these problems. One benefit of experience is that you know some paths are dead ends. We still haven’t figured out The One Right Path (and we never will), and maybe we’ve only traced out the longest path in a very long dead end in this maze of ideas we are traversing. Since I doubt the maze has any exit (nirvana?) it’s a valid debate about where we are trying to get at all. That said, I suspect we’ve out-explored Lisp. Lisp has been a worthy mentor, an intrepid explorer in his time, but he’s old and doesn’t get out much and only tells stories of where he’s been in the past. There are still things to be learned there, wisdom to be dug out of that environment, but Lisp and Python are not peers.

2008 03 21

Programming
Python
Ruby

Comments (28)

Permalink

PyCon Talks

I gave my talk at PyCon this year on HTML processing in Python. It seemed to be well received, but I did see some comments on the web from people who wanted more technical content. The presentation I ended up giving was really more about HTML and its place and advantages compared with XML and XHTML. Basically I decided to talk about why you want to process HTML, instead of getting too much into how. I talked a little bit about how, but if you had hoped to hear much technical substance then certainly it would have been disappointing. (As to the slide question, I will get them up, but I want to assemble at least a little of the material that motivated it, since slides alone aren’t that useful.)

I’m not sure what to do for talks like these. In 30 minutes it is hard to go into much depth about a subject. And I’m not sure what the purpose would be. If you want to learn to use a tool you should read the documentation, sit down with a computer, and give it a try. You certainly shouldn’t come to a talk. So I’ve tried to avoid the technical details, and instead try to make people want to learn the "how" on their own. This was the goal of my WSGI talk last year as well.

That said, the whole talk format was unsatisfying to me at PyCon. Lightning Talks are great (or at least, can be great — and I thought on Sunday when all the sponsors were gone, they were great). But 30 minute is too much time for just presenting an idea, and too little time (and a poor format) for presenting advanced content. And that slides are built into the format just makes it worse.

I wish the Open Spaces had been more functional at PyCon. I couldn’t find the ones I wanted and felt conflicted between them and talks (in retrospect of course there shouldn’t have been conflict). I didn’t have a single successful Open Space experience this year, despite trying a couple times.

While the organization of Open Spaces could be improved, I don’t think Just In Time Planning is going to work at PyCon. And I don’t think Open Spaces have to be JIT. What makes them most interesting isn’t that they are totally ad hoc, but that they aren’t just one person talking to a bunch of other people. I’d have been very happy to lead some more extensive "talk" about HTML and XML processing in Python, probably starting out with 10-15 minutes of introduction and survey and then going completely to discussion, Q/A, or wherever the attendees wanted to take it. I would expect a smaller group of people (I guess), as it would be a larger time commitment and attendance would feel less casual than just showing up at a normal talk. Maybe it could be clearly setup so that it was something like 10-15 minutes of introduction, 30-35 minutes of talking, and then a clearly planned structure for people to stop talking (moving on to another topic) or stick around to talk more in a smaller and more intimate space and group. Figuring out where to continue the discussion should not be left until the moment everyone has to choose whether to stay or leave. You’d have to be careful not to let it degenerate into people just asking questions of only personal interest. I’m pretty comfortable generalizing people’s overly specific questions. People who are less comfortable doing that might benefit from a moderator to help them guide the discussion.

This format seems like much less work for me as a speaker (and the stress of speaking greatly infringes on my enjoyment of the conference), it seems more likely to fit the interests of the attendees, and I think it can provide benefit for a much wider range of interests and experience levels.

Anyway, an idea. I’m not sure how things should be structured at PyCon next year, but I’m pretty sure the current talk format isn’t it. A few talks, sure: The State Of X talks can work well, as do talks about ideas and experience instead of talks about tools and libraries. But I think one track could be sufficient.

2008 03 21

Programming
Python

Comments (8)

Permalink

Ian Bicking: a blog

March 2008

Environmental Guilt

2008 03 31

Python HTML Parser Performance

2008 03 30

HTML Accessibility

2008 03 23

Monkeypatching and dead ends

2008 03 21

PyCon Talks

2008 03 21

Home

About

Archives

Categories

Recent Posts

Recent Comments